Let's say I have a string
"Hello"
and a list
words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']
How can I find the n words
that are the closest to "Hello"
and present in the list words
?
In this case, we would have ['hello', 'hallo', 'Hallo', 'hi', 'format'...]
So the strategy is to sort the list words from the closest word to the furthest.
I thought about something like this
word = 'Hello'
for i, item in enumerate(words):
if lower(item) > lower(word):
...
but it's very slow in large lists.
UPDATE
difflib
works but it's very slow also. (words list
has 630000+ words inside (sorted and one per line)). So checking the list takes 5 to 7 seconds for every search for closest word!
Use difflib.get_close_matches
.
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'Hallo', 'hallo']
Please look at the documentation, because the function returns 3 or less closest matches by default.
difflib.get_close_matches("Hallo", words, len(words), 0)
would give all matches : - Niklas B. 2012-04-04 20:30
difflib.get_close_matches
that return the list indexes, not the strings. I posted another question heretoto_tico 2018-06-14 15:40
There is an awesome article with a complete source code (21 lines) provided by Peter Norvig on spelling correction.
http://norvig.com/spell-correct.html
The idea is to build all possible edits of your word,
hello - helo - deletes
hello - helol - transpose
hello - hallo - replaces
hello - heallo - inserts
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
Now, look up each of these edits in your list.
Peter's article is a great read and worth reading.
Create a sorted list of your words and use the bisect module to identify the point in the sorted list where your word would fit according to the sorting order. Based on that position you can give the k nearest neighbours above and below to find the 2k closest words.
maybe heap can help you .
you have a heap named Heap
that until it's size is less than n
, you insert words into the Heap
using function close
[shows this string is closer than another string or not] .
this method can help you when n
is small :)
Heap = []
for word in words:
if len(Heap)<n:
Heap.insert(word)
else
if close(word,Heap[0]): # it means Heap[0] is the nth farthest word until now
Heap.pop():
Heap.insert(word)