小编Dav*_*bii的帖子

如何使用 PySpark 执行一种热编码

我在将多个列从分类值转换为数值时遇到问题。我正在使用 PySpark,但我确信问题不在于我使用的 Spark 版本。使用一列时没有问题,但在转换多列时遇到问题。这是代码,并且没有缺失值:

\n\n
from pyspark.ml import Pipeline\nfrom pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler\ncategorical_columns= [\'age\',\'job\', \'marital\',\'education\', \'default\', \'housing\', \'loan\', \'poutcome\', \'y\']\n\nindexers = [\n    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))\n    for c in categorical_columns\n]\n\nencoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),\n            outputCol="{0}_encoded".format(indexer.getOutputCol())) \n    for indexer in indexers\n]\n\n# Vectorizing encoded values\nassembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders],outputCol="features")\n\npipeline = Pipeline(stages=indexers + encoders+[assembler])\nmodel=pipeline.fit(df2)\ntransformed = model.transform(df2)\ntransformed.show(5)\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出是:

\n\n
---------------------------------------------------------------------------\nPy4JJavaError                             Traceback (most recent call last)\n<ipython-input-48-452b475faf1a> in <module>\n     20 \n     21 pipeline = Pipeline(stages=indexers + encoders+[assembler])\n---> 22 model=pipeline.fit(df2)\n     23 transformed = model.transform(df2)\n     24 transformed.show(5)\n\nE:\\spark-2.4.2-bin-hadoop2.7\\python\\pyspark\\ml\\base.py in …
Run Code Online (Sandbox Code Playgroud)

python apache-spark

2
推荐指数
1
解决办法
9309
查看次数

标签 统计

apache-spark ×1

python ×1