在为我们的新 ETL 管道进行概念验证时,我在 AWS Athena 中使用分区投影发现了一些问题。在glue中创建了下表:
CREATE EXTERNAL TABLE `test_interactions`(
`id` string,
`created_at` timestamp,
`created_by` string,
`type` string,
`entity` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'projection.dt.format'='yyyy-MM-dd-HH',
'projection.dt.interval'='1',
'projection.dt.interval.unit'='HOURS',
'projection.dt.range'='2020-12-01-00,NOW',
'projection.dt.type'='date',
'projection.enabled'='true',
'storage.location.template'='s3://test-aggs/test-interactions/dt=${dt}')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test-aggs/test-interactions/'
TBLPROPERTIES (
'classification'='parquet')
Run Code Online (Sandbox Code Playgroud)
在 S3 上,有来自 Kinesis Data Firehose 的匹配 .parquet 文件:
test-aggs/test-interactions/dt=2020-12-03-22/file1.parquet
test-aggs/test-interactions/dt=2020-12-03-22/file2.parquet
Run Code Online (Sandbox Code Playgroud)
尝试通过以下方式查询数据:
test-aggs/test-interactions/dt=2020-12-03-22/file1.parquet
test-aggs/test-interactions/dt=2020-12-03-22/file2.parquet
Run Code Online (Sandbox Code Playgroud)
或通过
SELECT * FROM "test_aggs"."test_interactions"
WHERE dt >= '2020-12-02-00'
AND dt < '2020-12-04-01'
Run Code Online (Sandbox Code Playgroud)
返回零结果。
跑步
SELECT …Run Code Online (Sandbox Code Playgroud) hadoop hive amazon-web-services amazon-athena amazon-kinesis-firehose
我尝试使用无服务器将 lambda 函数部署到 AWS。一切正常,但该函数无法执行,因为找不到两个文件(就是这么fs.readFileSync说的)。我将它们包含在 serverless.yml 中,并包含以下几行:
provider:
name: aws
runtime: nodejs10.x
stage: dev
region: eu-central-1
package:
exclude:
- .env
include:
- src/config/push-cert.pem
- src/config/push-key.pemRun Code Online (Sandbox Code Playgroud)
当我查看上传到 S3 的 .zip 文件时,两个 .pem 文件都不包含在内。我已经尝试使用__dirnamelambda 函数获取完整的文件路径。我的webpack.config.js样子如下:
const path = require("path");
const nodeExternals = require("webpack-node-externals");
const slsw = require("serverless-webpack");
module.exports = {
entry: slsw.lib.entries,
target: "node",
node: {
__dirname: true
},
mode: slsw.lib.webpack.isLocal?"development":"production",
externals: [nodeExternals()],
output: {
libraryTarget: "commonjs",
// pay attention to this
path: path.join(__dirname, ".webpack"),
filename: "[name].js" …Run Code Online (Sandbox Code Playgroud)amazon-web-services webpack serverless-framework serverless aws-serverless