将StringIndexer应用于PySpark Dataframe中的多个列

Iva*_*van 35 python apache-spark pyspark

我有一个PySpark数据帧

+-------+--------------+----+----+
|address|          date|name|food|
+-------+--------------+----+----+
|1111111|20151122045510| Yin|gre |
|1111111|20151122045501| Yin|gre |
|1111111|20151122045500| Yln|gra |
|1111112|20151122065832| Yun|ddd |
|1111113|20160101003221| Yan|fdf |
|1111111|20160703045231| Yin|gre |
|1111114|20150419134543| Yin|fdf |
|1111115|20151123174302| Yen|ddd |
|2111115|      20123192| Yen|gre |
+-------+--------------+----+----+
Run Code Online (Sandbox Code Playgroud)

我想转换为与pyspark.ml一起使用.我可以使用StringIndexer将name列转换为数字类别:

indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df)
df_ind = indexer.transform(df)
df_ind.show()
+-------+--------------+----+----------+----+
|address|          date|name|name_index|food|
+-------+--------------+----+----------+----+
|1111111|20151122045510| Yin|       0.0|gre |
|1111111|20151122045501| Yin|       0.0|gre |
|1111111|20151122045500| Yln|       2.0|gra |
|1111112|20151122065832| Yun|       4.0|ddd |
|1111113|20160101003221| Yan|       3.0|fdf |
|1111111|20160703045231| Yin|       0.0|gre |
|1111114|20150419134543| Yin|       0.0|fdf |
|1111115|20151123174302| Yen|       1.0|ddd |
|2111115|      20123192| Yen|       1.0|gre |
+-------+--------------+----+----------+----+
Run Code Online (Sandbox Code Playgroud)

如何使用StringIndexer转换多个列(例如,name并且food每个列都有自己的列StringIndexer),然后使用VectorAssembler生成特征向量?或者我是否必须StringIndexer为每列创建一个?

**编辑**:这不是一个骗局,因为我需要以编程方式为几个具有不同列名的数据帧.我不能使用VectorIndexerVectorAssembler因为列不是数字.

**编辑2**:暂定的解决方案是

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]
Run Code Online (Sandbox Code Playgroud)

我现在用三个数据框创建一个列表,每个数据框与原始数据框和转换后的列相同.现在我需要加入以形成最终的数据帧,但效率非常低.

Iva*_*van 61

我发现这样做的最好方法是StringIndex在列表中组合几个并使用a Pipeline来执行它们:

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]


pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)

df_r.show()
+-------+--------------+----+----+----------+----------+-------------+
|address|          date|food|name|food_index|name_index|address_index|
+-------+--------------+----+----+----------+----------+-------------+
|1111111|20151122045510| gre| Yin|       0.0|       0.0|          0.0|
|1111111|20151122045501| gra| Yin|       2.0|       0.0|          0.0|
|1111111|20151122045500| gre| Yln|       0.0|       2.0|          0.0|
|1111112|20151122065832| gre| Yun|       0.0|       4.0|          3.0|
|1111113|20160101003221| gre| Yan|       0.0|       3.0|          1.0|
|1111111|20160703045231| gre| Yin|       0.0|       0.0|          0.0|
|1111114|20150419134543| gre| Yin|       0.0|       0.0|          5.0|
|1111115|20151123174302| ddd| Yen|       1.0|       1.0|          2.0|
|2111115|      20123192| ddd| Yen|       1.0|       1.0|          4.0|
+-------+--------------+----+----+----------+----------+-------------+
Run Code Online (Sandbox Code Playgroud)

  • 你真的需要'适应'``indexers`?无论如何,你在`管道`中运行`fit`. (18认同)

Nic*_*aro 6

使用 PySpark 3.0+ 现在更容易,您可以使用inputColsoutputCols选项: https: //spark.apache.org/docs/latest/ml-features#stringindexer

class pyspark.ml.feature.StringIndexer(
    inputCol=..., 
    outputCol=..., 
    inputCols=..., 
    outputCols=..., 
    handleInvalid='error', 
    stringOrderType='frequencyDesc'
)
Run Code Online (Sandbox Code Playgroud)