最有效的方式/库来检测数十亿行中的预定义关键字?

Enn*_*oji 5 java lucene solr fuzzy-search string-matching

假设我有几十亿行文本和几百万个"关键字".任务是通过这些行,看看哪一行包含哪些关键字.换句话说,考虑到地图上 (K1 -> V1),并(K2 -> V2)创建地图(K2 -> K1),其中K1=lineID,V1=text,K2=keywordIDV2=keyword.还要注意:

  • 所有文字/关键字均为英文
  • 文本(V1)可能包含拼写错误.
  • 大多数关键字(V2)是单个单词,但有些关键字可能包含多个英文单词(例如"干净毛巾")

到目前为止,我最初的想法是解决这个问题如下:

1) Chop up all my keywords into single words and 
   create a large set of single words (K3)
2) Construct a BK-Tree out of these chopped up keywords,
   using Levenshtein distance
3) For each line of data (V1), 
    3.1) Chop up the text (V1) into words
    3.2) For each said word,
        3.2.1) Retrieve words (K3) from the BK-Tree that
               are close enough to said word
    3.3) Since at this point we still have false positives,
        (e.g. we would have matched "clean" from "clean water" against
         keyword "clean towel"), we check all possible combination
          using a trie of keyword (V2) to filter such false 
          positives out. We construct this trie so that at the
          end of an successful match, the keywordID (K2) can be retrieved.
    3.4) Return the correct set of keywordID (K2) for this line (V1)!
4) Profit! 
Run Code Online (Sandbox Code Playgroud)

我的问题

  • 这是一个好方法吗?效率非常重要 - 有没有更好的方法?有什么需要改进的吗?
  • 我可以使用任何库吗?最好是适合Java的东西.

提前致谢!

pio*_*rek 0

有一些优化的多模式/2D 搜索算法。不要再发明轮子了。您还应该考虑分配您的计算。也许是hadoop 和map/reduce?