m__*_*__0 6 hadoop hive amazon-web-services amazon-athena amazon-kinesis-firehose
在为我们的新 ETL 管道进行概念验证时,我在 AWS Athena 中使用分区投影发现了一些问题。在glue中创建了下表:
CREATE EXTERNAL TABLE `test_interactions`(
`id` string,
`created_at` timestamp,
`created_by` string,
`type` string,
`entity` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'projection.dt.format'='yyyy-MM-dd-HH',
'projection.dt.interval'='1',
'projection.dt.interval.unit'='HOURS',
'projection.dt.range'='2020-12-01-00,NOW',
'projection.dt.type'='date',
'projection.enabled'='true',
'storage.location.template'='s3://test-aggs/test-interactions/dt=${dt}')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test-aggs/test-interactions/'
TBLPROPERTIES (
'classification'='parquet')
Run Code Online (Sandbox Code Playgroud)
在 S3 上,有来自 Kinesis Data Firehose 的匹配 .parquet 文件:
test-aggs/test-interactions/dt=2020-12-03-22/file1.parquet
test-aggs/test-interactions/dt=2020-12-03-22/file2.parquet
Run Code Online (Sandbox Code Playgroud)
尝试通过以下方式查询数据:
test-aggs/test-interactions/dt=2020-12-03-22/file1.parquet
test-aggs/test-interactions/dt=2020-12-03-22/file2.parquet
Run Code Online (Sandbox Code Playgroud)
或通过
SELECT * FROM "test_aggs"."test_interactions"
WHERE dt >= '2020-12-02-00'
AND dt < '2020-12-04-01'
Run Code Online (Sandbox Code Playgroud)
返回零结果。
跑步
SELECT * FROM "test_aggs"."test_interactions"
WHERE dt = '2020-12-03-22'
Run Code Online (Sandbox Code Playgroud)
使数据可查询,但为了使用这个缓慢的命令,我不必启用分区投影。
有什么想法为什么这不起作用吗?
干杯!
解决了问题。问题是我在下面添加了投影配置SERDE PROPERTIES,而不是在TBLPROPERTIES.
CREATE EXTERNAL TABLE `test_interactions`(
`id` string,
`created_at` timestamp,
`created_by` string,
`type` string,
`entity` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test-aggs/test-interactions/'
TBLPROPERTIES (
'classification'='parquet',
'projection.dt.format'='yyyy-MM-dd-HH',
'projection.dt.interval'='1',
'projection.dt.interval.unit'='HOURS',
'projection.dt.range'='2020-12-01-00,NOW',
'projection.dt.type'='date',
'projection.enabled'='true',
'storage.location.template'='s3://test-aggs/test-interactions/dt=${dt}')
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1817 次 |
| 最近记录: |