在开发期间,我一直在"客户端"模式下运行我的火花作业.我使用"--file"与执行程序共享配置文件.Driver正在本地读取配置文件.现在我想以"集群"模式部署作业.我现在很难与驱动程序共享配置文件.
例如,我将配置文件名称作为extraJavaOptions传递给驱动程序和执行程序.我正在使用SparkFiles.get()读取文件
val configFile = org.apache.spark.SparkFiles.get(System.getProperty("config.file.name"))
Run Code Online (Sandbox Code Playgroud)
这在执行程序上运行良好但在驱动程序上失败.我认为文件只与执行程序共享,而不是与运行驱动程序的容器共享.一种选择是将配置文件保存在S3中.我想检查是否可以使用spark-submit实现这一点.
> spark-submit --deploy-mode cluster --master yarn --driver-cores 2
> --driver-memory 4g --num-executors 4 --executor-cores 4 --executor-memory 10g \
> --files /home/hadoop/Streaming.conf,/home/hadoop/log4j.properties \
> --conf **spark.driver.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --conf **spark.executor.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --class ....
Run Code Online (Sandbox Code Playgroud) 我有一个在EMR上运行的Spark工作需要很长时间.Spark任务本身运行速度很快.当我将结果保存到S3时,花费超过20分钟这样做...
16/02/05 01:44:44 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 561CA7CD8C009E79), S3 Extended Request ID: B3dMnYkxE/QSZsD1VREBf5FR+uH8m5k2Tb8zZ+Y0+VFgQFSwRJjPEWV7wX61+9ZiJhY5nf35Rx8=], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[561CA7CD8C009E79], ServiceEndpoint=[https://my-bucket-name.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[12.766], HttpRequestTime=[12.494], HttpClientReceiveResponseTime=[11.067], RequestSigningTime=[0.103], CredentialsRequestTime=[0.001], HttpClientSendRequestTime=[0.071],
16/02/05 01:44:44 INFO latency: StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[F84316D0C1958276], ServiceEndpoint=[https://my-bucket-name.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.001], HttpRequestTime=[13.1], HttpClientReceiveResponseTime=[11.69], RequestSigningTime=[0.085], CredentialsRequestTime=[0.001], ResponseProcessingTime=[2.673], HttpClientSendRequestTime=[0.071],
16/02/05 01:44:44 INFO S3NativeFileSystem: rename s3://my-bucket-name/stati/data/output/bidder4/_temporary/0/task_201602050130_0014_m_000001/organization_id=100932/impression_date=2016-01-01/part-r-00001-0e84d8cb-4b43-4cc3-b95e-65b1b9c12f25.gz.parquet s3://my-bucket-name/stati/data/output/bidder4/organization_id=100932/impression_date=2016-01-01/part-r-00001-0e84d8cb-4b43-4cc3-b95e-65b1b9c12f25.gz.parquet
16/02/05 01:44:44 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not …Run Code Online (Sandbox Code Playgroud)