Mri*_*lla 5 gradle amazon-web-services amazon-emr apache-spark
我有一个可以在本地运行的 Spark 应用程序。我拥有的依赖项是:
dependencies {
implementation "org.scala-lang:scala-library:${scalaVersion}"
implementation "org.apache.spark:spark-sql_2.12:${sparkVersion}"
implementation "org.apache.spark:spark-launcher_2.12:${sparkVersion}"
implementation "org.apache.spark:spark-catalyst_2.12:${sparkVersion}"
implementation "org.apache.spark:spark-streaming_2.12:${sparkVersion}"
implementation "org.apache.spark:spark-core_2.12:${sparkVersion}"
implementation group: 'org.apache.spark', name: 'spark-mllib_2.12', version: "${sparkVersion}"
implementation group: 'org.apache.spark', name: 'spark-hive_2.12', version: "${sparkVersion}"
implementation group: 'org.apache.spark', name: 'spark-yarn_2.12', version: "${sparkVersion}"
testImplementation group: 'org.apache.spark', name: 'spark-catalyst_2.12', version: "${sparkVersion}"
implementation group: 'org.apache.hadoop', name: 'hadoop-aws', version: hadoop_version
implementation group: 'org.mongodb.spark', name: 'mongo-spark-connector_2.12', version: '3.0.1'
testImplementation "com.holdenkarau:spark-testing-base_2.12:${sparkVersion}_1.1.0"
}
Run Code Online (Sandbox Code Playgroud)
查看 Spark 文档时,它列出了一个版本矩阵https://docs.amazonaws.cn/en_us/emr/latest/ReleaseGuide/emr-release-6x.html,解释了提供的依赖项。
该文档还指出应该使用一个单独的 Maven 兼容存储库:https : //docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-artifact-repository.html,我不确定是否应该使用像这样添加(与Maven Central一起):
repositories {
maven {
url "https://s3.us-east-1.amazonaws.com/us-east-1-emr-artifacts/emr-6.2.0/repos/maven/"
}
mavenCentral()
}
Run Code Online (Sandbox Code Playgroud)
据我所知,EMR 已经提供了某些依赖项,并且这些依赖项具有自定义 Spark 增强功能。
我的集群是使用以下命令创建的:
aws emr create-cluster --auto-scaling-role myprod-emr-auto-scaling --applications Name=Hadoop Name=Hive Name=Pig Name=Hue Name=Spark --bootstrap-actions '[{"Path":"s3://my-emr-bootstrap/sshkeys.sh","Name":"Add ssh keys"}]' --ebs-root-volume-size 20 --ec2-attributes '{"KeyName":"mridang_test","InstanceProfile":"myprod-emr","ServiceAccessSecurityGroup":"sg-xxxxxxxx","SubnetId":"subnet-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx"}' --service-role myprod-emr-service --enable-debugging --release-label emr-6.2.0 --log-uri 's3n://aws-logs-0123456789-us-east-1/elasticmapreduce/' --name 'MridangTest' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"c5a.xlarge","Name":"Core - 2"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"MASTER","InstanceType":"c5a.xlarge","Name":"Master - 1"}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1
Run Code Online (Sandbox Code Playgroud)
我的构建应该如何配置,以便我可以正确地只将必要的库包含到我的 ZIP 项目中?关于这个主题有相当多的附带资料,但没有任何内容可以解释如何配置它。
不幸的是,我的设置与 Gradle 相关。
| 归档时间: |
|
| 查看次数: |
99 次 |
| 最近记录: |