数据框由两列(s3ObjectName,batchName)组成,其中包含数万行,例如:-
| s3对象名称 | 批次名称 |
|---|---|
| a1.json | 45 |
| b2.json | 45 |
| c3.json | 45 |
| d4.json | 46 |
| e5.json | 46 |
目标是使用 foreachPartition() 和 foreach() 函数从 S3 存储桶中检索对象并使用数据帧中每行的详细信息并行写入数据湖
// s3 connector details defined as an object so it can be serialized and available on all executors in the cluster
object container {
def getDataSource() = {
val AccessKey = dbutils.secrets.get(scope = "ADBTEL_Scope", key = "Telematics-TrueMotion-AccessKey-ID")
val SecretKey = dbutils.secrets.get(scope = "ADBTEL_Scope", key = "Telematics-TrueMotion-AccessKey-Secret")
val creds = new BasicAWSCredentials(AccessKey, SecretKey)
val clientRegion: Regions = Regions.US_EAST_1
AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new …Run Code Online (Sandbox Code Playgroud)