我一直在尝试将 RDD 转换为数据帧。为此,需要定义类型而不是 Any。我正在使用 spark MLLib PrefixSpan,这就是 freqSequence.sequence 的来源。我从一个包含 Session_ID、视图和购买作为字符串数组的数据框开始:
viewsPurchasesGrouped: org.apache.spark.sql.DataFrame =
[session_id: decimal(29,0), view_product_ids: array[string], purchase_product_ids: array[string]]
Run Code Online (Sandbox Code Playgroud)
然后我计算频繁模式并在数据框中需要它们,以便我可以将它们写入 Hive 表。
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row => Array(Array(row(1)), Array(row(2)) ))
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
val freqSequencesRdd = sc.parallelize(model.freqSequences.collect())
case class FreqSequences(views: Array[String], purchases: Array[String], support: Long)
val viewsPurchasesDf = freqSequencesRdd.map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
viewsPurchasesDf.toDF() // …Run Code Online (Sandbox Code Playgroud)