PySpark Array<double> 不是 Array<double>

Question

PySpark Array<double> 不是 Array<double>

Rya*_*yan 5 apache-spark pyspark apache-spark-ml

我正在运行一个非常简单的 Spark（Databricks 上的 2.4.0）ML 脚本：

from pyspark.ml.clustering import LDA

lda = LDA(k=10, maxIter=100).setFeaturesCol('features')
model = lda.fit(dataset)

Run Code Online (Sandbox Code Playgroud)

但收到以下错误：

IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type array<double>.'

Run Code Online (Sandbox Code Playgroud)

为什么我array<double>的不是array<double>？

这是架构：

root
 |-- BagOfWords: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = true)

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 1

您可能需要使用向量汇编器将其转换为向量形式 from pyspark.ml.feature import VectorAssembler

归档时间：	6 年，6 月前
查看次数：	3252 次
最近记录：	5 年，6 月前

PySpark Array&lt;double&gt; 不是 Array&lt;double&gt;

PySpark Array<double> 不是 Array<double>