相关疑难解决方法(0)

计算余弦相似度Spark数据帧

我使用Spark Scala来计算Dataframe行之间的余弦相似度.

数据帧格式如下

root
    |-- SKU: double (nullable = true)
    |-- Features: vector (nullable = true)
Run Code Online (Sandbox Code Playgroud)

以下数据框的示例

    +-------+--------------------+
    |    SKU|            Features|
    +-------+--------------------+
    | 9970.0|[4.7143,0.0,5.785...|
    |19676.0|[5.5,0.0,6.4286,4...|
    | 3296.0|[4.7143,1.4286,6....|
    |13658.0|[6.2857,0.7143,4....|
    |    1.0|[4.2308,0.7692,5....|
    |  513.0|[3.0,0.0,4.9091,5...|
    | 3753.0|[5.9231,0.0,4.846...|
    |14967.0|[4.5833,0.8333,5....|
    | 2803.0|[4.2308,0.0,4.846...|
    |11879.0|[3.1429,0.0,4.5,4...|
    +-------+--------------------+
Run Code Online (Sandbox Code Playgroud)

我试图转置矩阵并检查以下提到的链接.Apache Spark Python Cosine与DataFrames 的相似性,计算 - 余弦相似性 - 通过-text-into-vector-using-tf-idf但我相信有一个更好的解决方案

我尝试了下面的示例代码

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  case (v,i:Vector) => IndexedRow(v, i)


}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
Run Code Online (Sandbox Code Playgroud)

但我得到了以下错误

Error:(80, 12) constructor cannot be instantiated to expected type;
 found   : (T1, T2)
 required: …
Run Code Online (Sandbox Code Playgroud)

scala apache-spark apache-spark-sql apache-spark-mllib

11
推荐指数
1
解决办法
3732
查看次数