我有以下Python测试代码(ALS.train其他地方定义的参数):
r1 = (2, 1)
r2 = (3, 1)
test = sc.parallelize([r1, r2])
model = ALS.train(ratings, rank, numIter, lmbda)
predictions = model.predictAll(test)
print test.take(1)
print predictions.count()
print predictions
Run Code Online (Sandbox Code Playgroud)
哪个有效,因为它对预测变量和输出的计数为1:
[(2, 1)]
1
ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423
Run Code Online (Sandbox Code Playgroud)
但是,当我尝试使用RDD我使用以下代码创建自己时,它似乎不再起作用了:
model = ALS.train(ratings, rank, numIter, lmbda)
validation_data = validation.map(lambda xs: tuple(int(x) for x in xs))
predictions = model.predictAll(validation_data)
print validation_data.take(1)
print predictions.count()
print validation_data
Run Code Online (Sandbox Code Playgroud)
哪个输出:
[(61, 3864)]
0
PythonRDD[4018] at RDD at PythonRDD.scala:43
Run Code Online (Sandbox Code Playgroud)
如您所见,predictAll传递映射后返回空RDD.进入的值都是相同的格式.我能看到的唯一明显的区别是第一个例子使用parallelize并产生一个 …
machine-learning apache-spark rdd pyspark apache-spark-mllib
我在 Spark(推荐系统算法)中使用 ALS 算法(implicitPrefs = True)。通常,运行此算法后,预测值必须从 0 到 1。但我收到的值大于 1
"usn" : 72164,
"recommendations" : [
{
"item_code" : "C1346",
"rating" : 0.756096363067627
},
{
"item_code" : "C0117",
"rating" : 0.966064214706421
},
{
"item_code" : "I0009",
"rating" : 1.00000607967377
},
{
"item_code" : "C0102",
"rating" : 0.974934458732605
},
{
"item_code" : "I0853",
"rating" : 1.03272235393524
},
{
"item_code" : "C0103",
"rating" : 0.928574025630951
}
]
Run Code Online (Sandbox Code Playgroud)
我不明白为什么或它的评级值大于 1(“评级”:1.00000607967377和“评级”:1.03272235393524)
一些类似但我仍然不明白的问题:MLLib spark -ALStrainImplicit value …