我正在使用pyspark(使用库)减少Spark DataFrame带有PCA模型的维度,spark ml如下所示:
pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
Run Code Online (Sandbox Code Playgroud)
在哪里data是一个Spark DataFrame实验室,其中features一个DenseVector是3维:
data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
Run Code Online (Sandbox Code Playgroud)
拟合后,我转换数据:
transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
Run Code Online (Sandbox Code Playgroud)
我的问题是:如何提取此PCA的特征向量?如何计算他们解释的方差?
我正在修补PySpark文档中的一些交叉验证代码,并尝试让PySpark告诉我选择了哪个模型:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.4]), 1.0),
(Vectors.dense([0.5]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
Run Code Online (Sandbox Code Playgroud)
在PySpark shell中运行它,我可以得到线性回归模型的系数,但我似乎无法找到lr.regParam交叉验证程序选择的值.有任何想法吗?
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])
In [4]: cvModel.bestModel.explainParams()
Out[4]: ''
In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}
In [15]: cvModel.params
Out[15]: [] …Run Code Online (Sandbox Code Playgroud) modeling cross-validation pyspark apache-spark-ml apache-spark-mllib
我有一个使用LinearRegression处理的DataFrame.如果我直接进行,如下所示,我可以显示模型的详细信息:
val lr = new LinearRegression()
val lrModel = lr.fit(df)
lrModel: org.apache.spark.ml.regression.LinearRegressionModel = linReg_b22a7bb88404
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
Coefficients: [0.9705748115939526] Intercept: 0.31041486689532866
Run Code Online (Sandbox Code Playgroud)
但是,如果我在管道中使用它(如下面的简化示例),
val pipeline = new Pipeline().setStages(Array(lr))
val lrModel = pipeline.fit(df)
Run Code Online (Sandbox Code Playgroud)
然后我收到以下错误.
scala> lrModel
res9: org.apache.spark.ml.PipelineModel = pipeline_99ca9cba48f8
scala> println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
<console>:68: error: value coefficients is not a member of org.apache.spark.ml.PipelineModel
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
^
<console>:68: error: value intercept is not a member of org.apache.spark.ml.PipelineModel
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
Run Code Online (Sandbox Code Playgroud)
我理解它意味着什么(显然我有一个不同的类,因为管道),但不知道如何到达真正的底层模型.