用Python中的Whoosh搜索模糊字符串

cif*_*key 11 python information-retrieval fuzzy-search whoosh

我在MongoDB中建立了一个庞大的银行数据库.我可以轻松地获取此信息并在其中创建索引.例如,我希望能够匹配银行名称'Eagle Bank&Trust Co of Missouri'和'Eagle Bank and Trust Company of Missouri'.以下代码适用于简单模糊这样,但无法实现上述匹配:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()

test_items = [u"Eagle Bank and Trust Company of Missouri"]

writer.add_document(name=item)
writer.commit()

from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm

with ix.searcher() as s:
    qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
    q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
    results = s.search(q)
    print results
Run Code Online (Sandbox Code Playgroud)

给我:

<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>
Run Code Online (Sandbox Code Playgroud)

是否有可能通过飞快移动达到我想要的效果?如果没有,我有什么其他基于python的解决方案?

Big*_*her 7

你可以搭配CoCompany使用嗖的模糊搜索,但你不应该这样做,因为之间的差异CoCompany大.Co类似于CompanyBe类似于BeastnyCompany,你可以想像大多么糟糕,以及如何将搜索结果.

但是,如果要匹配CompancompaniCompaneeCompany您可以通过使用的个性化类做FuzzyTerm默认maxdist等于2或更多:

maxdist - 给定文本的最大编辑距离.

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
Run Code Online (Sandbox Code Playgroud)

然后:

 qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)
Run Code Online (Sandbox Code Playgroud)

你可以匹配CoCompany通过设置maxdist5,但是这是我说给坏的搜索结果.我的建议是将maxdist来自13.

如果您正在寻找匹配单词语言变体,您最好使用whoosh.query.Variations.

注意:较旧的Whoosh版本minsimilarity代替maxdist.