Neo4j快速匹配模糊文本属性的方法

Fir*_*ger 6 indexing alias neo4j cypher

我有一个合理数量的节点(大约60,000)

(:Document {title:"A title"})
Run Code Online (Sandbox Code Playgroud)

给定一个标题,我想找到匹配的节点,如果存在的话.问题是我给出的标题不一致.也就是说,有时新单词的开头是Capital,有时候它都是小写的.有时Key-Words与Kebab案例相结合,有时它们通常被写成关键词.

为了弥补这一点,我使用了apoc和给定标题与每个节点之间的Levenshtein距离,并且如果它低于某个阈值,则只接受一个节点作为匹配:

MATCH (a:Document)
WHERE apoc.text.distance(a.title, "A title") < 10
RETURN a
Run Code Online (Sandbox Code Playgroud)

不能很好地扩展.目前单个查找需要大约700毫秒,这太慢了,因为这可能会增长到大约150,000个节点.

我在考虑alias:[...]在节点的属性中存储/缓存替代标题的出现并在所有别名上构建索引,但我不知道在Neo4j中这是否可行.

在给定大型节点数据库的情况下,"模糊查找"标题的最快方法是什么?

Chr*_*sen 17

在Neo4j 3.5(目前在beta03上),有FTS(全文搜索)功能.

编辑:我在Neo4j上写了一篇关于FTS的详细博客文章:https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html

您可以使用Lucene Classic Query Parser语法查询您的文档.

创建索引:

CALL db.index.fulltext.createNodeIndex('documents', ['Document'], ['title','text'])
Run Code Online (Sandbox Code Playgroud)

导入一些文件:

LOAD CSV WITH HEADERS FROM "file:///docs.csv" AS row
CREATE (n:Document) SET n = row
Run Code Online (Sandbox Code Playgroud)

查询标题包含"重收费"的文档

CALL db.index.fulltext.queryNodes('documents', 'title: "heavy toll"')
YIELD node, score
RETURN node.title, score

???????????????????????????????????????????????????????????????????????????????????????????
?"node.title"                                                          ?"score"           ?
???????????????????????????????????????????????????????????????????????????????????????????
?"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"?3.7325966358184814?
???????????????????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

使用拼写错误查询相同的标题:

CALL db.index.fulltext.queryNodes('documents', 'title: \\"heavy~ tall~\\"')
YIELD node, score
RETURN node.title, score
Run Code Online (Sandbox Code Playgroud)

注意转义quotes => \",传递给底层解析器的字符串应该包含引号,以便执行短语查询而不是布尔查询.

此外,tidle术语旁边还表示使用Damarau-Levenshtein算法进行模糊搜索.

??????????????????????????????????????????????????????????????????????????????????????????????
?"node.title"                                                          ?"score"              ?
??????????????????????????????????????????????????????????????????????????????????????????????
?"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"?0.868073046207428    ?
??????????????????????????????????????????????????????????????????????????????????????????????
?"Prisons Run by C.E.O.s? Privatization Under Trump Could Carry a Heavy?0.4014900326728821   ?
? Price - The New York Times"                                          ?                     ?
??????????????????????????????????????????????????????????????????????????????????????????????
?"‘All Talk,’ ‘No Action,’ Says Trump in Twitter Attack on Civil Rights?0.28181418776512146  ?
? Icon - The New York Times"                                           ?                     ?
??????????????????????????????????????????????????????????????????????????????????????????????
?"Immigrants Head to Washington to Rally While Obama Is Still There - T?0.24634429812431335  ?
?he New York Times"                                                    ?                     ?
??????????????????????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)