使用 Cloudformation 将分区投影添加到 AWS Athena 表

sil*_*ger 2 amazon-web-services aws-cloudformation amazon-athena

我有一个使用 cloudformation 中指定的模板定义的 Athena 表:

云信息创建

EventsTable:
  Type: AWS::Glue::Table
  Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: !Ref DatabaseName
    TableInput:
      Description: "My Table"
      Name: !Ref TableName
      TableType: EXTERNAL_TABLE
      StorageDescriptor:
        Compressed: True
        Columns:
          - Name: account_id
            Type: string
            Comment: "Account Id of the account making the request"
            ...
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        SerdeInfo:
          SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
        OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
        Location: !Sub "s3://${EventsBucketName}/events/"


Run Code Online (Sandbox Code Playgroud)

这运行良好并且可以部署。我还发现我可以根据此文档和此文档创建分区投影

并且可以通过直接创建表来实现这一点,大致如下:

SQL创建

CREATE EXTERNAL TABLE `performance_data.events`
(
  `account_id`  string,
...
)
   PARTITIONED BY (
     `day` string)
    ROW FORMAT SERDE
        'org.openx.data.jsonserde.JsonSerDe'
    STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
        OUTPUTFORMAT
          'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION
        's3://my-bucket/events/'
    TBLPROPERTIES (
        'has_encrypted_data' = 'false',
        'projection.enabled' = 'true',
        'projection.day.type' = 'date',
        'projection.day.format' = 'yyyy/MM/dd',
        'projection.day.range' = '2020/01/01,NOW',
        'projection.day.interval' = '1',
        'projection.day.interval.unit' = 'DAYS',
        'storage.location.template' = 's3://my-bucket/events/${day}/'
)
Run Code Online (Sandbox Code Playgroud)

但我找不到转换为云形成结构的文档。所以我的问题是,如何实现cloudformation中SQL代码所示的分区投影?

sil*_*ger 6

我现在有一个可行的解决方案。缺少的部分实际上是缺少参数,解决方案如下:


MyTableResource:
  Type: AWS::Glue::Table
  Properties:
    CatalogId: MyAccountId
    DatabaseName: MyDatabase
    TableInput:
      Description: "My Table"
      Name: mytable
      TableType: EXTERNAL_TABLE
      PartitionKeys:
        - Name: day
          Type: string
          Comment: Day partition
      Parameters:
        "projection.enabled": "true"
        "projection.day.type": "date"
        "projection.day.format": "yyyy/MM/dd"
        "projection.day.range": "2020/01/01,NOW"
        "projection.day.interval": "1"
        "projection.day.interval.unit": "DAYS"
        "storage.location.template":  "s3://my-bucket/events/${day}/"


      StorageDescriptor:
        Compressed: True
        Columns:
          ...

        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        SerdeInfo:
          Parameters:
            serialization.format: '1'
          SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
        OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
        Location: "s3://my-bucket/events/"
Run Code Online (Sandbox Code Playgroud)

关键的补充是:

serialization.format: '1'
Run Code Online (Sandbox Code Playgroud)

现在这完全可以工作,并且可以使用分区进行查询:

serialization.format: '1'
Run Code Online (Sandbox Code Playgroud)