我开始了一个hadoop集群.
我收到此警告消息:$HADOOP_HOME is deprecated
我已经添加export HADOOP_HOME_WARN_SUPPRESS="TRUE"了hadoop-env.sh
当我启动集群时,我没有看到任何警告消息.
然而,当我跑hadoop dfsadmin -report,它再次显示.
我正在尝试让我的Spark Streaming应用程序从S3目录中读取他的输入,但是在使用spark-submit脚本启动它之后我一直收到此异常:
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy6.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:195)
at MainClass$.main(MainClass.scala:1190)
at MainClass.main(MainClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) …Run Code Online (Sandbox Code Playgroud) 我想通过Spark(pyspark,真的)从我的(本地)机器读取一个S3文件.现在,我不断收到身份验证错误
java.lang.IllegalArgumentException:必须将AWS Access Key ID和Secret Access Key指定为s3n URL的用户名或密码,或者分别设置fs.s3n.awsAccessKeyId或fs.s3n.awsSecretAccessKey属性.
我在这里和网上到处寻找,尝试了很多东西,但显然S3在过去一年或几个月里一直在变化,所有方法都失败了但是一个:
pyspark.SparkContext().textFile("s3n://user:password@bucket/key")
Run Code Online (Sandbox Code Playgroud)
(注意s3n[ s3不起作用]).现在,我不想使用带有用户和密码的URL,因为它们可以出现在日志中,我也不知道如何从~/.aws/credentials文件中获取它们.
那么,我如何使用来自现在标准 ~/.aws/credentials文件的AWS凭证(或者更好地,pyspark)从S3本地读取(理想情况下,不将凭证复制到另一个配置文件)?
PS:我想os.environ["AWS_ACCESS_KEY_ID"] = …和os.environ["AWS_SECRET_ACCESS_KEY"] = …,也没有工作.
PPS:我不知道在哪里"设置fs.s3n.awsAccessKeyId或fs.s3n.awsSecretAccessKey属性"(Google没有提出任何建议).不过,我也尝试设置这些方法很多:SparkContext.setSystemProperty(),sc.setLocalProperty(),和conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf).没有任何效果.
股票hadoop2.6.0安装给了我no filesystem for scheme: s3n.hadoop-aws.jar现在添加到类路径给了我ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem.
我有一个大多数股票安装hadoop-2.6.0.我只设置目录,并设置以下环境变量:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
export HADOOP_COMMON_HOME=/opt/hadoop
export HADOOP_HOME=$HADOOP_COMMON_HOME
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_OPTS=-XX:-PrintWarnings
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
Run Code Online (Sandbox Code Playgroud)
该hadoop classpath方法是:
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/hadoop/share/hadoop/tools/lib/*
Run Code Online (Sandbox Code Playgroud)
当我试着奔跑时,hadoop distcp -update hdfs:///files/to/backup s3n://${S3KEY}:${S3SECRET}@bucket/files/to/backup我得到了Error: java.io.Exception, no filesystem for scheme: s3n.如果我使用s3a,我会得到同样的错误抱怨s3a.
在互联网告诉我,hadoop-aws.jar是不是默认的类路径的一部分.我添加了以下行/opt/hadoop/etc/hadoop/hadoop-env.sh:
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_COMMON_HOME/share/hadoop/tools/lib/*
Run Code Online (Sandbox Code Playgroud)
现在hadoop classpath附上以下内容:
:/opt/hadoop/share/hadoop/tools/lib/*
Run Code Online (Sandbox Code Playgroud)
这应该涵盖/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-2.6.0.jar.现在我得到:
Caused by: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Run Code Online (Sandbox Code Playgroud)
jar文件包含无法找到的类:
unzip -l /opt/hadoop/share/hadoop/tools/lib/hadoop-aws-2.6.0.jar |grep S3AFileSystem
28349 2014-11-13 …Run Code Online (Sandbox Code Playgroud) 我正在使用 PyCharm 2018.1,使用 Python 3.4 和通过 pip 在 virtualenv 中安装的 Spark 2.3。本地主机上没有安装hadoop,所以没有安装Spark(因此没有SPARK_HOME、HADOOP_HOME等)
当我尝试这个时:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
.setMaster("local")\
.setAppName("pyspark-unittests")\
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")
Run Code Online (Sandbox Code Playgroud)
我得到:
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3
Run Code Online (Sandbox Code Playgroud)
在本地模式下运行 pyspark 时如何从 s3 读取数据,而无需在本地安装完整的 Hadoop?
FWIW - 当我以非本地模式在 EMR 节点上执行它时,这很有效。
以下不起作用(相同的错误,尽管它确实解决并下载了依赖项):
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = …Run Code Online (Sandbox Code Playgroud) amazon-s3 ×4
apache-spark ×3
hadoop ×2
pyspark ×2
amazon-ec2 ×1
credentials ×1
deprecated ×1
hadoop2 ×1
python ×1
warnings ×1