tot*_*ico 9 python string similarity difflib python-3.x
我想使用类似difflib.get_close_matches但不是最相似的字符串,我想获取索引(即列表中的位置)。
列表的索引更加灵活,因为可以将索引与其他数据结构(与匹配的字符串相关)相关联。
例如,而不是:
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'hallo', 'Hallo']
Run Code Online (Sandbox Code Playgroud)
我想:
>>> difflib.get_close_matches('Hello', words)
[0, 1, 6]
Run Code Online (Sandbox Code Playgroud)
似乎不存在获取此结果的参数,是否有替代方法可以difflib.get_close_matches()返回索引?
我知道我可以使用difflib.SequenceMatcher, 然后将字符串与ratio(或quick_ratio)进行一对一比较。但是,我担心这会非常低效,因为:
我将不得不创建数千个 SequenceMatcher 对象并比较它们(我希望get_close_matches避免使用该类):
编辑:错误。我检查了 的源代码get_close_matches,它实际上使用SequenceMatcher.
没有截止(我猜有一种优化可以避免计算所有字符串的比率)
编辑:部分错误。该代码是get_close_matches没有任何重大的优化,除了它使用real_quick_ratio,quick_ratio而ratio产品总数。无论如何,我可以轻松地将优化复制到我自己的函数中。我也没有考虑到 SequenceMatcher 有设置序列的方法:set_seq1, set_seq2,所以至少我不必每次都创建一个对象。
据我了解,所有 python 库都是 C 编译的,这会提高性能。
编辑:我很确定情况就是这样。该函数位于名为 cpython 的文件夹中。
编辑:直接从 difflib 执行和将函数复制到文件 mydifflib.py之间存在细微差别(p 值为 0.030198)。
ipdb> timeit.repeat("gcm('hello', _vals)", setup="from difflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
[13.230449825001415, 13.126462900007027, 12.965455356999882, 12.955717618009658, 13.066136312991148, 12.935014379996574, 13.082025538009475, 12.943519036009093, 13.149949093989562, 12.970130036002956]
ipdb> timeit.repeat("gcm('hello', _vals)", setup="from mydifflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
[13.363269686000422, 13.087718107010005, 13.112324478992377, 13.358293497993145, 13.283965317998081, 13.056695280989516, 13.021098569995956, 13.04310674899898, 13.024205000008806, 13.152750282009947]
Run Code Online (Sandbox Code Playgroud)尽管如此,它并没有我预期的那么糟糕,我想除非有人知道另一个图书馆或替代品,否则我会继续。
tot*_*ico 12
我获取了 的源代码get_close_matches,并对其进行了修改以返回索引而不是字符串值。
# mydifflib.py
from difflib import SequenceMatcher
from heapq import nlargest as _nlargest
def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6):
"""Use SequenceMatcher to return a list of the indexes of the best
"good enough" matches. word is a sequence for which close matches
are desired (typically a string).
possibilities is a list of sequences against which to match word
(typically a list of strings).
Optional arg n (default 3) is the maximum number of close matches to
return. n must be > 0.
Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
that don't score at least that similar to word are ignored.
"""
if not n > 0:
raise ValueError("n must be > 0: %r" % (n,))
if not 0.0 <= cutoff <= 1.0:
raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
result = []
s = SequenceMatcher()
s.set_seq2(word)
for idx, x in enumerate(possibilities):
s.set_seq1(x)
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
result.append((s.ratio(), idx))
# Move the best scorers to head of list
result = _nlargest(n, result)
# Strip scores for the best n matches
return [x for score, x in result]
Run Code Online (Sandbox Code Playgroud)
>>> from mydifflib import get_close_matches_indexes
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> get_close_matches_indexes('hello', words)
[0, 1, 6]
Run Code Online (Sandbox Code Playgroud)
现在,我可以将此索引与字符串的关联数据相关联,而无需回溯这些字符串。