相关疑难解决方法(0)

从任务中调用Java/Scala函数

背景

我原来的问题是为什么使用DecisionTreeModel.predict内部地图功能会引发异常？并且与如何使用MLlib在Spark上生成(原始标签,预测标签)的元组有关？

当我们使用Scala API时,推荐RDD[LabeledPoint]使用预测的方法DecisionTreeModel是简单地映射RDD:

val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

Run Code Online (Sandbox Code Playgroud)

遗憾的是,PySpark中的类似方法效果不佳:

labelsAndPredictions = testData.map(
    lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()

Run Code Online (Sandbox Code Playgroud)

例外:您似乎尝试从广播变量,操作或转换引用SparkContext.SparkContext只能在驱动程序上使用,而不能在工作程序上运行的代码中使用.有关更多信息,请参阅SPARK-5063.

而不是官方文档推荐这样的东西:

predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

Run Code Online (Sandbox Code Playgroud)

那么这里发生了什么？此处没有广播变量,Scala API定义predict如下:

/**
 * Predict values for a single data point using the model trained.
 *
 * @param features array representing …

Run Code Online (Sandbox Code Playgroud)

python scala apache-spark pyspark apache-spark-mllib

zer*_*323

2017 08-31

37
推荐指数

1
解决办法

9913
查看次数

UDF的Pyspark错误：py4j.Py4JException：方法getnewargs （[]）不存在错误

我正在尝试解决以下错误（我正在使用Databricks平台和Spark 2.0）

tweets_cleaned.createOrReplaceTempView("tweets_cleanedSQL")
def Occ(keyword):
  occurences = spark.sql("SELECT * \
                                FROM tweets_cleanedSQL \
                                WHERE LOWER(text) LIKE '%" + keyword + "%' \
                            ")
  return occurences.count()


occurences_udf = udf(Occ)

Run Code Online (Sandbox Code Playgroud)

如果运行此代码，则会收到以下错误：

py4j.Py4JException：方法getnewargs（[]）不存在==>仅在尝试定义udf时发生错误。

python apache-spark pyspark databricks

Yor*_*ghe

2016 11-29

2
推荐指数

1
解决办法

3304
查看次数