cif*_*key 11 python information-retrieval fuzzy-search whoosh
我在MongoDB中建立了一个庞大的银行数据库.我可以轻松地获取此信息并在其中创建索引.例如,我希望能够匹配银行名称'Eagle Bank&Trust Co of Missouri'和'Eagle Bank and Trust Company of Missouri'.以下代码适用于简单模糊这样,但无法实现上述匹配:
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()
test_items = [u"Eagle Bank and Trust Company of Missouri"]
writer.add_document(name=item)
writer.commit()
from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm
with ix.searcher() as s:
qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
results = s.search(q)
print results
Run Code Online (Sandbox Code Playgroud)
给我:
<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>
Run Code Online (Sandbox Code Playgroud)
是否有可能通过飞快移动达到我想要的效果?如果没有,我有什么其他基于python的解决方案?
你可以搭配Co与Company使用嗖的模糊搜索,但你不应该这样做,因为之间的差异Co和Company大.Co类似于Company为Be类似于Beast与ny来Company,你可以想像大多么糟糕,以及如何将搜索结果.
但是,如果要匹配Compan或compani或Companee以Company您可以通过使用的个性化类做FuzzyTerm默认maxdist等于2或更多:
maxdist - 给定文本的最大编辑距离.
class MyFuzzyTerm(FuzzyTerm):
def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
Run Code Online (Sandbox Code Playgroud)
然后:
qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)
Run Code Online (Sandbox Code Playgroud)
你可以匹配Co与Company通过设置maxdist到5,但是这是我说给坏的搜索结果.我的建议是将maxdist来自1于3.
如果您正在寻找匹配单词语言变体,您最好使用whoosh.query.Variations.
注意:较旧的Whoosh版本minsimilarity代替maxdist.