我试图绘制具有列名称的某些基于树的模型的特征重要性.我正在使用Pyspark.
既然我有文本分类变量和数字变量,我不得不使用类似这样的管道方法 -
使用vectorassembler创建包含特征向量的要素列
步骤1,2,3 的文档中的一些示例代码-
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status",
"occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol + "Index")
# Use OneHotEncoder to convert categorical variables into binary
SparseVectors
# encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index",
outputCol=categoricalCol + "classVec")
encoder = OneHotEncoderEstimator(inputCols=
[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
# …Run Code Online (Sandbox Code Playgroud)