当关键词是多词时,有效地搜索关键词

suz*_*zee 5 python string pattern-matching string-matching keyword-search

我需要使用python有效地匹配一个非常大的关键字列表(> 1000000).我发现一些非常好的库试图快速完成这个:

1)FlashText(https://github.com/vi3k6i5/flashtext)

2)Aho-Corasick算法等

但是我有一个特殊的要求:在我的上下文中,如果我的字符串是'XXXX是YYYY的非常好的指示',则关键字'XXXX YYYY'应该返回匹配.请注意,'XXXX YYYY'不是作为子字符串出现的,但字符串中存在XXXX和YYYY,这对我来说足够好了.

我知道如何天真地做到这一点.我正在寻找的是效率,为此更好的图书馆?

saa*_*aaj 1

你问的听起来像是全文搜索任务。有一个名为whoosh的 Python 搜索包。@derek 的语料库可以在内存中进行索引和搜索,如下所示。

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)
Run Code Online (Sandbox Code Playgroud)

这会产生:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>
Run Code Online (Sandbox Code Playgroud)

您还可以按照如何索引文档FileStorage中的描述来保留索引。