我将 AWS Glue 与 pySpark 结合使用,并希望在 SparkSession 中添加一些配置,例如'"spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"、spark.hadoop.fs.s3a.multiobjectdelete.enable","false"、"spark.serializer", "org.apache.spark.serializer.KryoSerializer"、"spark.hadoop.fs.s3a.fast.upload","true"。我用来初始化上下文的代码如下:
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
Run Code Online (Sandbox Code Playgroud)
据我从文档中了解到,我应该在提交粘合作业时将这些配置添加为作业参数。是这样还是可以在初始化火花时添加它们?
我有以下 csv :
field1;field2;field3;field4;field5;field6;field7;field8;field9;field10;field11;field12;
eu;4523;35353;01/09/1999; 741 ; 386 ; 412 ; 86 ; 1.624 ; 1.038 ; 469 ; 117 ;
Run Code Online (Sandbox Code Playgroud)
我想将其转换为 avro。我创建了以下 avro 架构:
{"namespace": "forecast.avro",
"type": "record",
"name": "forecast",
"fields": [
{"name": "field1", "type": "string"},
{"name": "field2", "type": "string"},
{"name": "field3", "type": "string"},
{"name": "field4", "type": "string"},
{"name": "field5", "type": "string"},
{"name": "field6", "type": "string"},
{"name": "field7", "type": "string"},
{"name": "field8", "type": "string"},
{"name": "field9", "type": "string"},
{"name": "field10", "type": "string"},
{"name": "field11", "type": "string"},
{"name": "field12", "type": …Run Code Online (Sandbox Code Playgroud)