使用 Pyspark 从 Spark DataFrame 创建labeledPoints

jow*_*l93 5 random-forest rdd pyspark apache-spark-mllib

我有一个 Spark Dataframe,其中有两个 coulmn“标签”和“稀疏向量”,这是在将 Countvectorizer 应用到推文语料库后获得的。

当尝试训练随机森林回归模型时,我发现它只接受 LabeledPoint 类型。

有谁知道如何将我的 Spark DataFrame 转换为 LabeledPoint

ham*_*una 6

您使用的是哪个 Spark 版本。Spark使用spark ml而不是mllib。

from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.sql import functions as F

# Input data: Each row is a bag of words with a ID.
df = sqlContext.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

model = cv.fit(df)

result = model.transform(df).withColumn('label', F.lit(0))
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
rf.fit(result)
Run Code Online (Sandbox Code Playgroud)

如果你坚持使用 mllib:

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import RandomForest

rdd = result \ 
          .rdd \
          .map(lambda row: LabeledPoint(row['label'], row['features'].toArray()))
RandomForest.trainClassifier(rdd, 2, {}, 3)
Run Code Online (Sandbox Code Playgroud)