我正在尝试使用 Spark-Submit 运行 Spark 作业。当我在 eclipse 中运行它时,作业运行没有任何问题。当我将相同的 jar 文件复制到远程计算机并在那里运行作业时,出现以下问题
17/08/09 10:19:15 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-50-70-180.ec2.internal): java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics; local class incompatible: stream classdesc serialVersionUID = -2231953621568687904, local class serialVersionUID = -6966587383730940799
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1829)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)
我在 SO 中看到了一些其他链接并尝试了以下内容
将spark jars的版本从我之前使用的2.10更改为2.11。现在pom中的依赖关系看起来像这样
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-yarn_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.0.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.2</version>
<scope>provided</scope>
</dependency>
Run Code Online (Sandbox Code Playgroud)我还检查了 Spark 的 jars 文件夹中是否存在版本 2.11-2.0.2,如一些链接中所建议的。
我还按照几个链接中的建议添加了依赖项中提供的内容
以上都没有帮助。当我陷入这个问题时,任何帮助都会有很大帮助。提前致谢。干杯
编辑1:这是spark-submit命令
spark-submit --deploy-mode cluster --class "com.abc.ingestion.GenericDeviceIngestionSpark" /home/hadoop/sathiya/spark_driven_ingestion-0.0.1-SNAPSHOT-jar-with-dependencies.jar "s3n://input-bucket/input-file.csv" "SIT" "accessToken" "UNKNOWN" "bundleId" "[{"idType":"D_ID","idOrder":1,"isPrimary":true},{"idType":"HASH_DEVICE_ID","idOrder":2,"isPrimary":false}]"
Run Code Online (Sandbox Code Playgroud)
编辑2:
我还尝试添加变量serialVersionUID = -2231953621568687904L; 到相关的课程,但这并没有解决问题
我终于解决了这个问题。我注释掉了所有依赖项,并一次取消注释一个。首先,我取消注释了Spark_core依赖项,问题得到了解决。我取消了项目中另一个依赖项的注释,这再次带回了问题。然后经过调查,我发现第二个依赖项又依赖于 Spark_core 的不同版本(2.10),这导致了问题。我向依赖项添加了排除,如下所示:
<dependency>
<groupId>com.data.utils</groupId>
<artifactId>data-utils</artifactId>
<version>1.0-SNAPSHOT</version>
<exclusions>
<exclusion>
<groupId>javax.ws.rs</groupId>
<artifactId>javax.ws.rs-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
</exclusion>
</exclusions>
</dependency>
Run Code Online (Sandbox Code Playgroud)
这解决了这个问题。以防万一有人陷入这个问题。感谢@JosePraveen 的宝贵评论,它给了我提示。
| 归档时间: |
|
| 查看次数: |
2217 次 |
| 最近记录: |