Ana*_*ake 3 python scala apache-spark apache-spark-ml apache-spark-mllib
我有两个独立的DataFrames,每个都有几个不同的处理阶段,我mllib在管道中使用变换器来处理.
我现在想要将这两个管道连接在一起,保持每个管道的功能(列)DataFrame.
Scikit-learn有FeatureUnion处理这个的类,我似乎找不到相应的mllib.
我可以在一个管道的末尾添加一个自定义变换器阶段,该管道将另一个管道生成的DataFrame作为属性并将其连接到transform方法中,但这看起来很混乱.
Pipeline或者PipelineModel是有效的PipelineStages,因此可以合并为一个Pipeline.例如:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([
(1.0, 0, 1, 1, 0),
(0.0, 1, 0, 0, 1)
], ("label", "x1", "x2", "x3", "x4"))
pipeline1 = Pipeline(stages=[
VectorAssembler(inputCols=["x1", "x2"], outputCol="features1")
])
pipeline2 = Pipeline(stages=[
VectorAssembler(inputCols=["x3", "x4"], outputCol="features2")
])
Run Code Online (Sandbox Code Playgroud)
你可以结合Pipelines:
Pipeline(stages=[
pipeline1, pipeline2,
VectorAssembler(inputCols=["features1", "features2"], outputCol="features")
]).fit(df).transform(df)
Run Code Online (Sandbox Code Playgroud)
+-----+---+---+---+---+---------+---------+-----------------+
|label|x1 |x2 |x3 |x4 |features1|features2|features |
+-----+---+---+---+---+---------+---------+-----------------+
|1.0 |0 |1 |1 |0 |[0.0,1.0]|[1.0,0.0]|[0.0,1.0,1.0,0.0]|
|0.0 |1 |0 |0 |1 |[1.0,0.0]|[0.0,1.0]|[1.0,0.0,0.0,1.0]|
+-----+---+---+---+---+---------+---------+-----------------+
Run Code Online (Sandbox Code Playgroud)
或预装PipelineModels:
model1 = pipeline1.fit(df)
model2 = pipeline2.fit(df)
Pipeline(stages=[
model1, model2,
VectorAssembler(inputCols=["features1", "features2"], outputCol="features")
]).fit(df).transform(df)
Run Code Online (Sandbox Code Playgroud)
+-----+---+---+---+---+---------+---------+-----------------+
|label| x1| x2| x3| x4|features1|features2| features|
+-----+---+---+---+---+---------+---------+-----------------+
| 1.0| 0| 1| 1| 0|[0.0,1.0]|[1.0,0.0]|[0.0,1.0,1.0,0.0]|
| 0.0| 1| 0| 0| 1|[1.0,0.0]|[0.0,1.0]|[1.0,0.0,0.0,1.0]|
+-----+---+---+---+---+---------+---------+-----------------+
Run Code Online (Sandbox Code Playgroud)
因此,我建议的方法是先加入数据,并fit和transform全DataFrame.
也可以看看:
| 归档时间: |
|
| 查看次数: |
1767 次 |
| 最近记录: |