Python: find closest string (from a list) to another string

Go To StackoverFlow.com

43

Let's say I have a string "Hello" and a list

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']

How can I find the n words that are the closest to "Hello" and present in the list words ?

In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]

So the strategy is to sort the list words from the closest word to the furthest.

I thought about something like this

word = 'Hello'
for i, item in enumerate(words):
    if lower(item) > lower(word):
      ...

but it's very slow in large lists.

UPDATE difflib works but it's very slow also. (words list has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!

2012-04-04 20:22
by Laura
Maybe you are looking for something like editing distance or Levinshtein distance - tripleee 2012-04-04 20:24
Exactly, this largely depends on what your definition of 'closest' is - Gareth Latty 2012-04-04 20:27
Are the 630,000 words sorted? Are they in a file, one word per line - Peter Wood 2012-04-04 21:31
sorted and one per line ye - Laura 2012-04-04 21:32
How do you intend to define 'closest'? In your sample code, you're using a lexicographic comparison, but that ranks 'hermitage' as a better match for 'hello' than 'jello' is - Nick Johnson 2012-04-05 05:08
Did you find an efficient solution for 6M+ dictionary items? I'm stocked here as well - Nick 2016-01-08 22:19


68

Use difflib.get_close_matches.

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'Hallo', 'hallo']

Please look at the documentation, because the function returns 3 or less closest matches by default.

2012-04-04 20:27
by Oleh Prypin
Just a quick FYI: difflib.get_close_matches("Hallo", words, len(words), 0) would give all matches : - Niklas B. 2012-04-04 20:30
The Levenshtein difference can be used as well. There is a good python implementation http://pypi.python.org/pypi/python-Levenshtein - Maksym Polshcha 2012-04-04 20:47
difflib works but it's very slow also. (words list has 630000+ words inside). So checking the list takes 5 to 7 seconds for every search for closest word - Laura 2012-04-04 21:02
@Laura There is a difference between being slow and taking a long time. 7 seconds might be a long time but it might not be slow - Peter Wood 2012-04-05 10:32
@Oleh, I was wondering if you happen to know if there is an alternative of difflib.get_close_matches that return the list indexes, not the strings. I posted another question heretoto_tico 2018-06-14 15:40


21

There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.

http://norvig.com/spell-correct.html

The idea is to build all possible edits of your word,

hello - helo   - deletes    
hello - helol  - transpose    
hello - hallo  - replaces    
hello - heallo - inserts    


def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

Now, look up each of these edits in your list.

Peter's article is a great read and worth reading.

2012-04-04 22:34
by Amjith
Thanks, that is a great find. I am writing an integrated dictionary and this might be useful - laurasia 2012-11-27 01:59


1

Create a sorted list of your words and use the bisect module to identify the point in the sorted list where your word would fit according to the sorting order. Based on that position you can give the k nearest neighbours above and below to find the 2k closest words.

2012-04-04 20:34
by user1308520
Are you sure this will work? I don't think lexicographical order is what OP wants.. - Oleh Prypin 2012-04-04 20:40
Considering the code snippet from the question, such a simple solution could be all that is required - user1308520 2012-04-04 20:58
This won't work if possible spelling mistakes are to be considered, especially if mistakes are made at the beginning of the word. But a good solution for correct words indeed - laurasia 2012-11-27 02:03


0

maybe heap can help you .

you have a heap named Heap that until it's size is less than n , you insert words into the Heap using function close [shows this string is closer than another string or not] .

this method can help you when n is small :)

Heap = []
for word in words:
    if len(Heap)<n:
       Heap.insert(word)
    else
       if close(word,Heap[0]): # it means Heap[0] is the nth farthest word until now
             Heap.pop():
             Heap.insert(word)
2012-04-05 08:38
by Divuneh
Ads