将数据帧拟合到randomForest pyspark中

ABK*_*ABK 3 python apache-spark pyspark apache-spark-ml

我有一个DataFrame看起来像这样:

+--------------------+------------------+
|            features|           labels |
+--------------------+------------------+
|[-0.38475, 0.568...]|          label1  |
|[0.645734, 0.699...]|          label2  |
|     .....          |          ...     |
+--------------------+------------------+
Run Code Online (Sandbox Code Playgroud)

两列都是String类型(StringType()),我想把它装入spark ml randomForest.为此,我需要将features列转换为包含浮点数的向量.有没有人知道怎么做?

eli*_*sah 6

如果您使用的是Spark 2.x,我相信这就是您所需要的:

from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

def parse(s):
  try:
    return Vectors.parse(s).asML()
  except:
    return None

parse_ = udf(parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+
Run Code Online (Sandbox Code Playgroud)

使用Spark 1.6,它没有太大区别:

from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

df = sqlContext.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

parse_ = udf(Vectors.parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+
Run Code Online (Sandbox Code Playgroud)

Vectors有一个parse功能,可以帮助您实现您想要做的事情.