Enn*_*oji 5 java lucene solr fuzzy-search string-matching
假设我有几十亿行文本和几百万个"关键字".任务是通过这些行,看看哪一行包含哪些关键字.换句话说,考虑到地图上 (K1 -> V1),并(K2 -> V2)创建地图(K2 -> K1),其中K1=lineID,V1=text,K2=keywordID和V2=keyword.还要注意:
到目前为止,我最初的想法是解决这个问题如下:
1) Chop up all my keywords into single words and
create a large set of single words (K3)
2) Construct a BK-Tree out of these chopped up keywords,
using Levenshtein distance
3) For each line of data (V1),
3.1) Chop up the text (V1) into words
3.2) For each said word,
3.2.1) Retrieve words (K3) from the BK-Tree that
are close enough to said word
3.3) Since at this point we still have false positives,
(e.g. we would have matched "clean" from "clean water" against
keyword "clean towel"), we check all possible combination
using a trie of keyword (V2) to filter such false
positives out. We construct this trie so that at the
end of an successful match, the keywordID (K2) can be retrieved.
3.4) Return the correct set of keywordID (K2) for this line (V1)!
4) Profit!
Run Code Online (Sandbox Code Playgroud)
我的问题
提前致谢!
| 归档时间: |
|
| 查看次数: |
290 次 |
| 最近记录: |