Eva*_*mir 11 python apache-spark pyspark apache-spark-ml apache-spark-mllib
尝试构建ML时出现以下错误Pipeline:
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).'
Run Code Online (Sandbox Code Playgroud)
我的features列包含一个浮点值数组.听起来我需要将它们转换为某种类型的向量(它不是稀疏的,所以是DenseVector?).有没有办法直接在DataFrame上执行此操作,还是需要转换为RDD?
zer*_*323 22
您可以使用UDF:
udf(lambda vs: Vectors.dense(vs), VectorUDT())
Run Code Online (Sandbox Code Playgroud)
在Spark <2.0导入中:
from pyspark.mllib.linalg import Vectors, VectorUDT
Run Code Online (Sandbox Code Playgroud)
在Spark 2.0+导入中:
from pyspark.ml.linalg import Vectors, VectorUDT
Run Code Online (Sandbox Code Playgroud)
请注意,尽管实现相同,但这些类不兼容.
也可以提取单个特征并与之组合VectorAssembler.假设调用输入列features:
from pyspark.ml.feature import VectorAssembler
n = ... # Size of features
assembler = VectorAssembler(
inputCols=["features[{0}]".format(i) for i in range(n)],
outputCol="features_vector")
assembler.transform(df.select(
"*", *(df["features"].getItem(i) for i in range(n))
))
Run Code Online (Sandbox Code Playgroud)