使用本地机器从s3读取数据 - pyspark

tec*_*ile 2 hadoop amazon-s3 amazon-web-services apache-spark pyspark

from pyspark.sql import SparkSession
import boto3
import os
import pandas as pd

spark = SparkSession.builder.getOrCreate()

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", "myaccesskey")
hadoop_conf.set("fs.s3a.secret.key", "mysecretkey")
hadoop_conf.set("fs.s3a.endpoint", "s3.amazonaws.com")
hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.connection.ssl.enabled", "true")

conn = boto3.resource("s3", region_name="us-east-1")

df = spark.read.csv("s3a://mani-test-1206/test/test.csv", header=True)
df.show()

spark.stop()
Run Code Online (Sandbox Code Playgroud)

运行上面的代码时出现以下错误: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found

Hadoop 和 aws jars 程序正在使用:

Spark-hadoop-分布:spark-3.2.0-bin-hadoop3.2

hadoop jars:
hadoop-annotations-3.2.0.jar
hadoop-auth-3.2.0.jar
hadoop-aws-3.2.0.jar
hadoop-client-api-3.3.1.jar
hadoop-client-runtime-3.3.1.jar
hadoop-common-3.2.0.jar
hadoop-hdfs-3.2.0.jar

aws jars:
aws-java-sdk-1.11.624.jar
aws-java-sdk-core-1.11.624.jar
aws-java-sdk-dynamodb-1.11.624.jar
aws-java-sdk-s3-1.11.624.jar
Run Code Online (Sandbox Code Playgroud)

任何帮助将不胜感激,谢谢。

小智 7

我有同样的问题。对我有什么帮助:

  • 将 hadoop-aws-3.2.0 更新到 3.2.2 版本
  • 使用 "fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (看起来名称更改)