相关疑难解决方法(0)

如何将VectorAssembler输出中的特征映射回Spark ML中的列名？

我正在尝试在PySpark中运行线性回归,我想创建一个包含汇总统计信息的表,例如我的数据集中每列的系数,P值和t值.但是,为了训练线性回归模型,我必须使用Spark创建一个特征向量VectorAssembler,现在对于每一行我都有一个特征向量和目标列.当我尝试访问Spark的内置回归摘要统计信息时,它们会为每个统计信息提供一个非常原始的数字列表,并且无法知道哪个属性对应于哪个值,这很难通过手动计算出来大量的列.如何将这些值映射回列名？

例如,我的当前输出是这样的:

系数:[ - 187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-1206.47228704,33.7078197705,99.9956812528]

P值:[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18589731365614548,0.275173571416679,0.0]

t统计量:[ - 23.348593508995318,-44.72813283953004,19.836508234714472,144.49248881747755,-16.547272230754242,-9.560681351483941,-19.563547400189073,1.3232383890822680,1.0912415361190977,20.383256127350474]

系数标准误差:[8.043646497811427,4.182131353367049,4.293682291754585,73.32793120907755,7.690626652102948,4.108783841348964,61.669402913526625,25.481445101737247,91.63478289909655,609.7007361468519]

除非我知道它们对应哪个属性,否则这些数字毫无意义.但在我看来,DataFrame我只有一个名为"features"的列,其中包含稀疏向量行.

当我有一个热编码特征时,这是一个更大的问题,因为如果我有一个长度为n的编码变量,我会得到n个相应的系数/ p值/ t值等.

python machine-learning apache-spark pyspark apache-spark-ml

cha*_*der

2017 03-23

18
推荐指数

2
解决办法

6986
查看次数

'CrossValidatorModel' 对象没有属性 'featureImportances'

我正在尝试提取random forest classifier我使用Pyspark. 我参考了下面的文章来获得我训练的随机森林模型的特征重要性分数。

PySpark 和 MLLib：随机森林特征的重要性

但是，当我使用本文中描述的方法时，出现以下错误

'CrossValidatorModel' object has no attribute 'featureImportances'

Run Code Online (Sandbox Code Playgroud)

这是我用来训练模型的代码

cols = new_data.columns
stages = []
label_stringIdx = StringIndexer(inputCol = 'Bought_Fibre', outputCol = 'label')
stages += [label_stringIdx]
numericCols = new_data.schema.names[1:-1]
assembler = VectorAssembler(inputCols=numericCols, outputCol="features")
stages += [assembler]

pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(new_data)
new_data.fillna(0, subset=cols)
new_data = pipelineModel.transform(new_data)
new_data.fillna(0, subset=cols)
new_data.printSchema()


train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 1045)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()

train_sampled = train_initial.sampleBy("label", fractions={0: 0.1, 1: 1.0}, seed=0) …

Run Code Online (Sandbox Code Playgroud)

machine-learning random-forest apache-spark pyspark apache-spark-mllib

Tus*_*hta

lucky-day

5
推荐指数

1
解决办法

2957
查看次数

列转换后Pyspark随机森林特征重要性映射

我试图绘制具有列名称的某些基于树的模型的特征重要性.我正在使用Pyspark.

既然我有文本分类变量和数字变量,我不得不使用类似这样的管道方法 -

使用字符串索引器来索引字符串列
为所有列使用一个热编码器

使用vectorassembler创建包含特征向量的要素列

步骤1,2,3 的文档中的一些示例代码-

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status", 
"occupation", "relationship", "race", "sex", "native_country"]
 stages = [] # stages in our Pipeline
 for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, 
    outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary 
    SparseVectors
    # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
    outputCol=categoricalCol + "classVec")
    encoder = OneHotEncoderEstimator(inputCols= 
    [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # …

Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql pyspark apache-spark-mllib

aam*_*irr

2018 06-20

2
推荐指数

1
解决办法

2293
查看次数