Ati*_*ska -1 java hadoop amazon-s3 apache-spark
我正在尝试将数据从 aws s3 读取到 Java 中的 dataset/rdd 中。我在 IntelliJ 上运行 Java 中的 Spark 代码,因此也在 pom.xml 中添加了 Hadoop 依赖项
下面是我的代码和 pom.xml 文件。
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.api.java.JavaSparkContext;
public class SparkJava {
public static void main(String[] args){
SparkSession spark = SparkSession
.builder()
.master("local")
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("fs.s3n.awsAccessKeyId", AWS_KEY)
.config("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)
.getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
String input_path = "s3a://bucket/2018/07/28";
JavaRDD<String> s3aRdd = sc.textFile(input_path);
long count = s3aRdd.count(); // THIS IS CAUSING EXCEPTION
System.out.print(count);
System.out.print("Finished");
}
}
Run Code Online (Sandbox Code Playgroud)
以下是 pom.xml 的依赖项
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.1.1</version>
</dependency>
</dependencies>
Run Code Online (Sandbox Code Playgroud)
在这种情况下,不存在版本问题,如本问题所述:NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities whileReading s3 Data with Spark
除了上述之外,还通过在 pom.xml 中添加以下依赖项来解决此问题:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.1</version>
</dependency>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5036 次 |
| 最近记录: |