woo*_*r13 4 python apache-spark pyspark
我正在尝试测试 pyspark 是否在我的系统上正常运行,但是当我尝试对我的数据调用 fit 时,我收到错误“要求失败:没有向此摘要器添加任何内容”
import findspark
import os
spark_location='/usr/local/spark/'
java8_location= '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ['JAVA_HOME'] = java8_location
findspark.init(spark_home=spark_location)
import pyspark, itertools, string, datetime, math
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession
from pyspark.mllib.evaluation import RegressionMetrics
from pyspark.sql.functions import isnan, isnull, when, count, col
def main():
spark = pyspark.sql.SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
#data = spark.read.option("inferSchema", True).option("header", True).csv("ml-20m/ratings.csv").drop("timestamp")
data = spark.read.option("inferSchema", True).option("header", True).csv("ml-20m/ratings_test.csv").drop("timestamp")
train,test= data.randomSplit([0.8, 0.2])
print("before als")
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", nonnegative=True)
print("before param_grid")
param_grid = ParamGridBuilder().addGrid(als.rank, [12,13,14]).addGrid(als.maxIter, [18,19,20]).addGrid(als.regParam, [.17,.18,.19]).build()
#################### RMSE ######################
print("before evaluator")
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
print("before cv")
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3)
print("before fit")
model = cv.fit(train)
model = model.bestModel
print("before transform")
predictions = model.transform(test)
print("before rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE", rmse)
print("rank", model.rank)
print("MaxIter", model._java_obj.parent().getMaxIter())
print("RegParam", model._java_obj.parent().getRegParam())
main()
Run Code Online (Sandbox Code Playgroud)
我测试了数据帧以确保数据帧内没有 Null 或 NaN。
我有同样的错误,才意识到我的测试集是空的(分割不正确)
确保您的训练集和测试集具有这些项目。
train,test= data.randomSplit([0.8, 0.2])
执行完后train.show(), test.show()
| 归档时间: |
|
| 查看次数: |
4500 次 |
| 最近记录: |