Amazon EMR 和 Yarn 部署模式

Net*_*cks 3 amazon-web-services amazon-emr hadoop-yarn pyspark

我正在学习 Spark 基础知识,为了测试我的 Pyspark 应用程序,我在 AWS 上使用 Spark、Yarn、Hadoop、Oozie 创建了一个 EMR 实例。我能够使用 Spark-submit 从驱动程序节点成功执行一个简单的 pyspark 应用程序。我有 AWS 使用 Yarn Resource Manager 创建的默认 /etc/spark/conf/spark-default.conf 文件。一切运行良好,我也可以监控跟踪 URL。但我无法区分 Spark 作业是在“客户端”模式还是“集群”模式下运行。我如何确定这一点?

摘自 /etc/spark/conf/spark-default.conf

spark.master                     yarn                                                                                                            
spark.driver.extraLibraryPath    /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native                                                       
spark.executor.extraClassPath    :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar    
spark.executor.extraLibraryPath  /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs:///var/log/spark/apps
spark.history.fs.logDirectory    hdfs:///var/log/spark/apps
spark.sql.warehouse.dir          hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address ip-xx-xx-xx-xx.ec2.internal:18080 
spark.history.ui.port            18080
spark.shuffle.service.enabled    true 
spark.driver.extraJavaOptions    -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions                                                              
spark.executor.memory            4743M                                                                                                           
spark.executor.cores             2                                                                                                               
spark.yarn.executor.memoryOverheadFactor 0.1875
spark.driver.memory              2048M
Run Code Online (Sandbox Code Playgroud)

我的 pypspark 工作摘录:

import os.path
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf   
from boto3.session import Session 

conf = SparkConf().setAppName('MyFirstPySparkApp')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext 
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", SECRET_KEY) 
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
....# access S3 bucket
....
....
Run Code Online (Sandbox Code Playgroud)

是否有一种称为“yarn-client”的部署模式,或者只是“客户端”和“集群”?另外,为什么AWS没有在配置文件中指定“num-executors”?这是我需要添加的吗?

谢谢

Lam*_*nus 5

由您提交作业时发送选项的方式决定,请参阅文档

从 EMR 控制台或 Web 服务器访问 Spark 历史记录服务器后,您可以spark.submit.deployMode在“环境”选项卡中找到该选项。就我而言,它是客户端模式。

在此输入图像描述