相关疑难解决方法(0)

Pyspark和PCA:如何提取此PCA的特征向量?如何计算他们解释的方差?

我正在使用pyspark(使用库)减少Spark DataFrame带有PCA模型的维度,spark ml如下所示:

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
Run Code Online (Sandbox Code Playgroud)

在哪里data是一个Spark DataFrame实验室,其中features一个DenseVector是3维:

data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
Run Code Online (Sandbox Code Playgroud)

拟合后,我转换数据:

transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
Run Code Online (Sandbox Code Playgroud)

我的问题是:如何提取此PCA的特征向量?如何计算他们解释的方差?

pca apache-spark apache-spark-sql pyspark apache-spark-ml

21
推荐指数
4
解决办法
1万
查看次数

如何从PySpark中的spark.ml中提取模型超参数?

我正在修补PySpark文档中的一些交叉验证代码,并尝试让PySpark告诉我选择了哪个模型:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
Run Code Online (Sandbox Code Playgroud)

在PySpark shell中运行它,我可以得到线性回归模型的系数,但我似乎无法找到lr.regParam交叉验证程序选择的值.有任何想法吗?

In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: [] …
Run Code Online (Sandbox Code Playgroud)

modeling cross-validation pyspark apache-spark-ml apache-spark-mllib

21
推荐指数
2
解决办法
1万
查看次数

如何在ML Pipeline中访问底层模型的参数?

我有一个使用LinearRegression处理的DataFrame.如果我直接进行,如下所示,我可以显示模型的详细信息:

val lr = new LinearRegression()
val lrModel = lr.fit(df)

lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_b22a7bb88404

println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
Coefficients: [0.9705748115939526] Intercept: 0.31041486689532866
Run Code Online (Sandbox Code Playgroud)

但是,如果我在管道中使用它(如下面的简化示例),

val pipeline = new Pipeline().setStages(Array(lr))
val lrModel = pipeline.fit(df)
Run Code Online (Sandbox Code Playgroud)

然后我收到以下错误.

scala> lrModel
res9: org.apache.spark.ml.PipelineModel = pipeline_99ca9cba48f8

scala> println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
<console>:68: error: value coefficients is not a member of org.apache.spark.ml.PipelineModel
       println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
                                         ^
<console>:68: error: value intercept is not a member of org.apache.spark.ml.PipelineModel
       println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
Run Code Online (Sandbox Code Playgroud)

我理解它意味着什么(显然我有一个不同的类,因为管道),但不知道如何到达真正的底层模型.

scala apache-spark apache-spark-mllib

2
推荐指数
1
解决办法
1317
查看次数