Scala/Spark高效的部分字符串匹配

Gam*_*ows 2 string scala apache-spark

我正在使用Scala在Spark中编写一个小程序,并遇到了一个问题.我有一个单字串的List/RDD和一个List/RDD的句子,这些句子可能包含也可能不包含单个单词列表中的单词.即

val singles = Array("this", "is")
val sentence = Array("this Date", "is there something", "where are something", "this is a string")
Run Code Online (Sandbox Code Playgroud)

我想选择包含单个单词中一个或多个单词的句子,结果应该是这样的:

output[(this, Array(this Date, this is a String)),(is, Array(is there something, this is a string))]
Run Code Online (Sandbox Code Playgroud)

我考虑了两种方法,一种是通过拆分句子并使用.contains进行过滤.另一种是将句子分割并格式化为RDD并使用.join进行RDD交集.我正在查看大约50个单词和500万个句子,哪种方法会更快?还有其他解决方案吗?你能不能帮我编码,我的代码似乎没有结果(尽管编译并运行没有错误)

Shy*_*nki 5

您可以创建一组必需的键,在句子中查找键并按键分组.

val singles = Array("this", "is")

val sentences = Array("this Date", 
                      "is there something", 
                      "where are something", 
                      "this is a string")

val rdd = sc.parallelize(sentences) // create RDD

val keys = singles.toSet            // words required as keys.

val result = rdd.flatMap{ sen => 
                    val words = sen.split(" ").toSet; 
                    val common = keys & words;       // intersect
                    common.map(x => (x, sen))        // map as key -> sen
                }
                .groupByKey.mapValues(_.toArray)     // group values for a key
                .collect                             // get rdd contents as array

// result:
// Array((this, Array(this Date, this is a string)),
//       (is,   Array(is there something, this is a string)))
Run Code Online (Sandbox Code Playgroud)