Sam*_*m88 -1 scala dot-product apache-spark
我在 Spark Scala 中有两个数据帧,其中每个数据帧的第二列是一个数字数组
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22.show(truncate=false)
+---+---------------------+
|ID |tf_idf |
+---+---------------------+
|1 |[0.693, 0.702] |
|2 |[0.69314, 0.0] |
|3 |[0.0, 0.693147] |
+---+---------------------+
val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12.show(truncate=false)
+---+--------------------+
|ID |tf_idf |
+---+--------------------+
|1 |[0.693, 0.805] |
+---+--------------------+
Run Code Online (Sandbox Code Playgroud)
我需要在这两个数据框中的行之间执行点积。这就是我需要乘以tf_idf阵列中data12与每一行tf_idf中data22。
(例如:点积的第一行应该是这样的:0.693*0.693 + 0.702*0.805
第二行:0.69314*0.693 + 0.0*0.805
第三行:0.0*0.693 + 0.693147*0.805)
基本上我想要一些东西(比如矩阵乘法)data22*transpose(data12)
如果有人能建议一种在 Spark Scala 中做到这一点的方法,我将不胜感激。
谢谢
Spark 2.4+ 版:对数组使用多个函数,例如zip_withand aggregate,它们为您提供更简单的代码。按照你的详细说明,我已经改变了join进入crossJoin。
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
val data12= Seq((1,List(0.693,0.805))).toDF("ID2","tf_idf2")
val df = data22.crossJoin(data12).drop("ID2")
df.withColumn("DotProduct", expr("aggregate(zip_with(tf_idf, tf_idf2, (x, y) -> x * y), 0D, (sum, x) -> sum + x)")).show(false)
Run Code Online (Sandbox Code Playgroud)
这是结果。
+---+---------------------+--------------+-------------------+
|ID |tf_idf |tf_idf2 |DotProduct |
+---+---------------------+--------------+-------------------+
|1 |[0.693147, 0.6931471]|[0.693, 0.805]|1.0383342865 |
|2 |[0.69314, 0.0] |[0.693, 0.805]|0.48034601999999993|
|3 |[0.0, 0.693147] |[0.693, 0.805]|0.557983335 |
+---+---------------------+--------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
937 次 |
| 最近记录: |