如何在Spark shell中使用TwitterUtils?

hel*_*elm 9 apache-spark

我正在尝试在Spark Shell中使用twitterUtils(默认情况下它们不可用).

我添加了以下内容spark-env.sh:

SPARK_CLASSPATH="/disk.b/spark-master-2014-07-28/external/twitter/target/spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar"
Run Code Online (Sandbox Code Playgroud)

我现在可以执行了

import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
Run Code Online (Sandbox Code Playgroud)

没有shell中的错误,如果没有将jar添加到类路径中是不可能的("错误:对象twitter不是包org.apache.spark.streaming的成员").但是,在Spark shell中执行它时会出错:

scala> val ssc = new StreamingContext(sc, Seconds(1))
ssc: org.apache.spark.streaming.StreamingContext =
org.apache.spark.streaming.StreamingContext@6e78177b

scala> val tweets = TwitterUtils.createStream(ssc, "twitter.txt")
error: bad symbolic reference. A signature in TwitterUtils.class refers to
term twitter4j in package <root> which is not available.
It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling
TwitterUtils.class.
Run Code Online (Sandbox Code Playgroud)

我错过了什么?我必须再导入另一个罐子吗?

Nic*_*mas 7

是的,除了你已有的之外,你需要Twitter4J JAR spark-streaming-twitter.具体来说,Spark开发者建议使用Twitter4J 3.0.3版.

下载正确的JAR后,您将需要通过--jars标志将它们传递给shell .我想你也可以SPARK_CLASSPATH像你一样做到这一点.

以下是我在Spark EC2集群上的表现:

#!/bin/bash
cd /root/spark/lib
mkdir twitter4j

# Get the Spark Streaming JAR.
curl -O "http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-streaming-twitter_2.10/1.0.0/spark-streaming-twitter_2.10-1.0.0.jar"

# Get the Twitter4J JARs. Check out http://twitter4j.org/archive/ for other versions.
TWITTER4J_SOURCE=twitter4j-3.0.3.zip
curl -O "http://twitter4j.org/archive/$TWITTER4J_SOURCE"
unzip -j ./$TWITTER4J_SOURCE "lib/*.jar" -d twitter4j/
rm $TWITTER4J_SOURCE

cd
# Point the shell to these JARs and go!
TWITTER4J_JARS=`ls -m /root/spark/lib/twitter4j/*.jar | tr -d '\n'`
/root/spark/bin/spark-shell --jars /root/spark/lib/spark-streaming-twitter_2.10-1.0.0.jar,$TWITTER4J_JARS
Run Code Online (Sandbox Code Playgroud)