将类似词汇分组的好策略是什么？

Question

将类似词汇分组的好策略是什么？

假设我有一个带有拼写错误和类似小变化的电影名单 -

 "Pirates of the Caribbean: The Curse of the Black Pearl"
 "Pirates of the carribean"
 "Pirates of the Caribbean: Dead Man's Chest"
 "Pirates of the Caribbean trilogy"
 "Pirates of the Caribbean"
 "Pirates Of The Carribean"

Run Code Online (Sandbox Code Playgroud)

如何组合或查找这样的单词集,最好使用python和/或redis？

Answer 1

Fre*_*ihl 16

看看"模糊匹配".下面的线程中的一些很棒的工具可以计算字符串之间的相似性.

我特别喜欢difflib模块

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

Run Code Online (Sandbox Code Playgroud)

/sf/ask/47765721/

@FredrikPihl，您能否为我们这些不值得的低声誉农民在这里发布 `get_close_matches` 的定义（或将其编辑到答案中）？ (2认同)
看来我问得太早了 - 它只是 difflib 的一部分的方法：https://docs.python.org/2/library/difflib.html#difflib.get_close_matches (2认同)

Answer 2

ste*_*emm 5

您可能会注意到相似的字符串有很大的公共子字符串，例如：

“Bla bla bLa”和“Bla bla bRa”=>公共子串是“Bla bla ba”（注意第三个词）

要找到公共子串，您可以使用动态规划算法。算法的变体之一是Levenshtein 距离（最相似的字符串之间的距离非常小，而更多不同的字符串之间的距离更大） - http://en.wikipedia.org/wiki/Levenshtein_distance。

另外，为了获得快速性能，您可以尝试采用Soundex 算法- http://en.wikipedia.org/wiki/Soundex。

因此，在计算所有字符串之间的距离后，您必须对它们进行聚类。最简单的方法是k-means（但它需要您定义簇的数量）。如果您实际上不知道簇的数量，则必须使用层次聚类。请注意，您的情况中的簇数是不同电影标题的数量 + 1（对于完全拼写错误的字符串）。

归档时间：	14 年，2 月前
查看次数：	11748 次
最近记录：	13 年，6 月前