在 Pyspark 中评估分类器时，'SparkSession' 对象没有属性 'serializer'

Question

在 Pyspark 中评估分类器时，'SparkSession' 对象没有属性 'serializer'

Cob*_*bra 3 python apache-spark apache-spark-sql pyspark

我在批处理模式下使用 Apache spark。我已经建立了一个完整的管道，将文本转换为 TFIDF 向量，然后使用逻辑回归预测一个布尔类：

# Chain previously created feature transformers, indexers and regression in a Pipeline
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, 
                        labelIndexer, featureIndexer, lr])
#Fit the full model to the training data
model = pipeline.fit(trainingData)

#Predict test data 
predictions = model.transform(testData)

Run Code Online (Sandbox Code Playgroud)

我可以检查predictions，这是一个火花数据框，这正是我所期望的。接下来，我想查看混淆矩阵，因此我将分数和标签转换为 RDD 并将其传递给 BinaryClassificationMetrics()：

predictionAndLabels = predictions.select('prediction','label').rdd

Run Code Online (Sandbox Code Playgroud)

最后，我将其传递给 BinaryClassificationMetrics：

metrics = BinaryClassificationMetrics(predictionAndLabels) #this errors out

Run Code Online (Sandbox Code Playgroud)

这是错误：

AttributeError: 'SparkSession' object has no attribute 'serializer'

Run Code Online (Sandbox Code Playgroud)

此错误没有帮助，搜索它会引发广泛的问题。我发现唯一看起来相似的是这篇没有答案的帖子：如何解决错误“AttributeError:'SparkSession' object has no attribute 'serializer'?

任何帮助表示赞赏！

Answer 1

Cob*_*bra 8

为了繁荣，这就是我为解决这个问题所做的。当我启动 Spark Session 和 SQL 上下文时，我正在这样做，这是不对的：

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sc)

Run Code Online (Sandbox Code Playgroud)

这个问题是通过这样做来解决的：

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sparkContext=sc.sparkContext, sparkSession=sc)

Run Code Online (Sandbox Code Playgroud)

我不确定为什么需要明确说明，如果有人知道，欢迎社区澄清。

归档时间：	7 年前
查看次数：	4122 次
最近记录：	7 年前