suz*_*zee 5 python string pattern-matching string-matching keyword-search
我需要使用python有效地匹配一个非常大的关键字列表(> 1000000).我发现一些非常好的库试图快速完成这个:
1)FlashText(https://github.com/vi3k6i5/flashtext)
2)Aho-Corasick算法等
但是我有一个特殊的要求:在我的上下文中,如果我的字符串是'XXXX是YYYY的非常好的指示',则关键字'XXXX YYYY'应该返回匹配.请注意,'XXXX YYYY'不是作为子字符串出现的,但字符串中存在XXXX和YYYY,这对我来说足够好了.
我知道如何天真地做到这一点.我正在寻找的是效率,为此更好的图书馆?
你问的听起来像是全文搜索任务。有一个名为whoosh的 Python 搜索包。@derek 的语料库可以在内存中进行索引和搜索,如下所示。
from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields
texts = [
"Here's a sentence with dog and apple in it",
"Here's a sentence with dog and poodle in it",
"Here's a sentence with poodle and apple in it",
"Here's a dog with and apple and a poodle in it",
"Here's an apple with a dog to show that order is irrelevant"
]
schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()
writer = index.writer()
for t in texts:
writer.add_document(text = t)
writer.commit()
query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)
for r in results:
print(r)
Run Code Online (Sandbox Code Playgroud)
这会产生:
<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>
Run Code Online (Sandbox Code Playgroud)
您还可以按照如何索引文档FileStorage中的描述来保留索引。
| 归档时间: |
|
| 查看次数: |
416 次 |
| 最近记录: |