有效地建立具有给定汉明距离的单词图

lau*_*ids 19 python algorithm hamming-distance graph-algorithm

我想从汉明距离为(例如)1 的单词列表中构建一个图形,或者换句话说,如果它们只与一个字母(lo l - > lo t)不同,则连接两个单词.

所以给定

words = [ lol, lot, bot ]

图表将是

{
  'lol' : [ 'lot' ],
  'lot' : [ 'lol', 'bot' ],
  'bot' : [ 'lot' ]
}
Run Code Online (Sandbox Code Playgroud)

简单的方法是将列表中的每个单词与其他单词进行比较并计算不同的字符; 遗憾的是,这是一种O(N^2)算法.

我可以使用哪种algo/ds /策略来获得更好的性能?

另外,我们假设只有拉丁字符,并且所有单词都具有相同的长度.

ffe*_*rri 21

假设您将字典存储在a中set(),因此查找的平均值为O(1)(最差情况为O(n)).

您可以从一个单词生成汉明距离1处的所有有效单词:

>>> def neighbours(word):
...     for j in range(len(word)):
...         for d in string.ascii_lowercase:
...             word1 = ''.join(d if i==j else c for i,c in enumerate(word))
...             if word1 != word and word1 in words: yield word1
...
>>> {word: list(neighbours(word)) for word in words}
{'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}
Run Code Online (Sandbox Code Playgroud)

如果M是单词的长度,L是字母表的长度(即26),则用这种方法找到相邻单词的最坏情况时间复杂度是O(L*M*N).

"简单方法"方法的时间复杂度为O(N ^ 2).

当这种方法更好?什么时候L*M < N,即如果只考虑小写字母,何时M < N/26.(我认为这里只是最糟糕的情况)

注意:英文单词的平均长度为5.1个字母.因此,如果您的字典大小超过132个单词,您应该考虑这种方法.

可能有可能实现比这更好的性能.然而,这实现起来非常简单.

实验基准:

"简单方法"算法(A1):

from itertools import zip_longest
def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2))
def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}
Run Code Online (Sandbox Code Playgroud)

该算法(A2):

def graph2(words): return {word: list(neighbours(word)) for word in words}
Run Code Online (Sandbox Code Playgroud)

基准代码:

for dict_size in range(100,6000,100):
    words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)])
    t1 = Timer(lambda: graph1()).timeit(10)
    t2 = Timer(lambda: graph2()).timeit(10)
    print('%d,%f,%f' % (dict_size,t1,t2))
Run Code Online (Sandbox Code Playgroud)

输出:

100,0.119276,0.136940
200,0.459325,0.233766
300,0.958735,0.325848
400,1.706914,0.446965
500,2.744136,0.545569
600,3.748029,0.682245
700,5.443656,0.773449
800,6.773326,0.874296
900,8.535195,0.996929
1000,10.445875,1.126241
1100,12.510936,1.179570
...
Run Code Online (Sandbox Code Playgroud)

数据图

我用N的较小步骤运行了另一个基准测试,看得更接近:

10,0.002243,0.026343
20,0.010982,0.070572
30,0.023949,0.073169
40,0.035697,0.090908
50,0.057658,0.114725
60,0.079863,0.135462
70,0.107428,0.159410
80,0.142211,0.176512
90,0.182526,0.210243
100,0.217721,0.218544
110,0.268710,0.256711
120,0.334201,0.268040
130,0.383052,0.291999
140,0.427078,0.312975
150,0.501833,0.338531
160,0.637434,0.355136
170,0.635296,0.369626
180,0.698631,0.400146
190,0.904568,0.444710
200,1.024610,0.486549
210,1.008412,0.459280
220,1.056356,0.501408
...
Run Code Online (Sandbox Code Playgroud)

数据图2

您会看到权衡非常低(长度为3的单词词典为100).对于小字典,O(N ^ 2)算法表现稍好,但随着N的增长,O(LMN)算法很容易击败.

对于具有较长单词的字典,O(LMN)算法在N中保持线性,它只是具有不同的斜率,因此权衡略微向右移动(130表示长度= 5).


Dav*_*tat 6

没有必要依赖于字母表大小.bot例如,给出一个单词,将其插入到键下的单词列表字典中?ot, b?t, bo?.然后,对于每个单词列表,连接所有对.

import collections


d = collections.defaultdict(list)
with open('/usr/share/dict/words') as f:
    for line in f:
        for word in line.split():
            if len(word) == 6:
                for i in range(len(word)):
                    d[word[:i] + ' ' + word[i + 1:]].append(word)
pairs = [(word1, word2) for s in d.values() for word1 in s for word2 in s if word1 < word2]
print(len(pairs))
Run Code Online (Sandbox Code Playgroud)


c-s*_*ile 5

三元搜索Trie支持近邻搜索.

如果您的字典存储为TST,那么,我相信,在构建图形时,查找的平均复杂性将接近O(N*log(N))现实世界的单词字典.

使用三元搜索树文章检查Efficient auto-complete.