小编Cur*_*ead的帖子

SparkError:XXXX任务的序列化结果总大小(2.0 GB)大于spark.driver.maxResultSize(2.0 GB)

错误:

ERROR TaskSetManager: Total size of serialized results of XXXX tasks (2.0 GB) is bigger than spark.driver.maxResultSize (2.0 GB)
Run Code Online (Sandbox Code Playgroud)

目标:获取使用该模型的所有用户的建议,并与每个用户测试数据重叠并生成重叠率.

我使用spark mllib构建了一个推荐模型.我评估每个用户的测试数据和每个用户的推荐项目的重叠比率,并生成平均重叠率.

  def overlapRatio(model: MatrixFactorizationModel, test_data: org.apache.spark.rdd.RDD[Rating]): Double = {

    val testData: RDD[(Int, Iterable[Int])] = test_data.map(r => (r.user, r.product)).groupByKey
    val n = testData.count

    val recommendations: RDD[(Int, Array[Int])] = model.recommendProductsForUsers(20)
      .mapValues(_.map(r => r.product))

    val overlaps = testData.join(recommendations).map(x => {
      val moviesPerUserInRecs = x._2._2.toSet
      val moviesPerUserInTest = x._2._1.toSet
      val localHitRatio = moviesPerUserInRecs.intersect(moviesPerUserInTest)
      if(localHitRatio.size > 0)
        1
      else
        0
    }).filter(x => x != …
Run Code Online (Sandbox Code Playgroud)

scala apache-spark apache-spark-mllib

7
推荐指数
1
解决办法
2590
查看次数

标签 统计

apache-spark ×1

apache-spark-mllib ×1

scala ×1