我一直在按照这里的教程使用AWS Glue进行一些快速分析
虽然我已经能够成功创建爬虫并在 Athena 中发现数据,但我遇到了爬虫创建的数据类型的问题。该date和timestamp数据类型得到读的string数据类型。
我随后通过ETL使用爬虫创建的数据源作为输入和Amazon S3 中的目标表在 GLUE 中创建作业
作为映射转换的一部分,我将日期和时间戳的数据类型转换为string,timestamp但不幸的是 ETL 将这些列类型转换为NULLS. 我曾考虑将分类器与GROK表达式一起使用,但后来决定将它们转换为 GLUE 中 ETL 的一部分。
时间戳格式为 1/08/2010 6:15:00 PM
我正在寻找一个带有数组的JSON文件到我的数据库中.带有数组项的json文件如下: -
{
"campaignId": "11067182",
"campaignName": "11067182",
"channelId": "%pxbid_universal_site_id=!;",
"channelName": "%pxbid_universal_site_id=!;",
"placementId": "%epid!",
"placementName": "%epid!",
"publisherId": "%esid!",
"publisherName": "%esid!",
"hitDate": "2017-03-23",
"lowRiskImpressions": "61485",
"lowRiskPct": "64.5295",
"moderateRiskImpressions": "1887",
"moderateRiskPct": "1.9804",
"highRiskImpressions": "43",
"highRiskPct": "0.0451",
"veryHighRiskImpressions": "860",
"veryHighRiskPct": "0.9026",
"totalRated": "95274",
"unrated": "8",
"unratedPct": "0.0084",
"visibleCount": "64283",
"pctVisible": "67.4660",
"invisibleCount": "30999",
"totalImpressions": "95282"
}
{
"campaignId": "11067182",
"campaignName": "11067182",
"channelId": "%pxbid_universal_site_id=!;",
"channelName": "%pxbid_universal_site_id=!;",
"placementId": "%epid!",
"placementName": "%epid!",
"publisherId": "%esid!",
"publisherName": "%esid!",
"hitDate": "2017-03-22",
"lowRiskImpressions": "17929",
"lowRiskPct": "52.9379",
"moderateRiskImpressions": "1872",
"moderateRiskPct": "5.5273",
"highRiskImpressions": …Run Code Online (Sandbox Code Playgroud) 我希望为 Redshift 生成一个清单文件,其中COPY包含aws s3api --list-objects和jq,如下所示:-
aws s3api list-objects --bucket annalects3 --prefix "DFA/20160926/394007-OMD-Coles/dcm_account394007_impression" --output json --query '{"entries": Contents[].{"url":"Key"}}' | jq '.entries[].mandatory = true'
Run Code Online (Sandbox Code Playgroud)
它生成如下输出:-
{ "entries": [
{
"mandatory": true,
"url": "DFA/20160926/394007-OMD-Coles/dcm_account394007_impression_2016092507_20160926_002328_292527438.csv.gz"
},
{
"mandatory": true,
"url": "DFA/20160926/394007-OMD-Coles/dcm_account394007_impression_2016092508_20160926_020131_292592736.csv.gz"
},
{
"mandatory": true,
"url": "DFA/20160926/394007-OMD-Coles/dcm_account394007_impression_2016092509_20160926_030312_292502379.csv.gz"
},
{
"mandatory": true,
"url": "DFA/20160926/394007-OMD-Coles/dcm_account394007_impression_2016092510_20160926_033656_292590227.csv.gz"
}
]
}
Run Code Online (Sandbox Code Playgroud)
但是,清单文件需要以存储桶名称为前缀的 URL 对象,但我没有使用过。输出需要看起来像
{ "entries": [
{
"mandatory": true,
"url": "s3://mybucket/DFA/20160926/394007-OMD-Coles/dcm_account394007_impression_2016092507_20160926_002328_292527438.csv.gz"
},
{
"mandatory": true,
"url": "s3://mybucket/DFA/20160926/394007-OMD-Coles/dcm_account394007_impression_2016092508_20160926_020131_292592736.csv.gz"
},
{
"mandatory": true,
"url": …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用jq它形成一个 JSON 构造,理想情况下应如下所示:-
{
"api_key": "XXXXXXXXXX-7AC9-D655F83B4825",
"app_guid": "XXXXXXXXXXXXXX",
"time_start": 1508677200,
"time_end": 1508763600,
"traffic": [
"event"
],
"traffic_including": [
"unattributed_traffic"
],
"time_zone": "Australia/NSW",
"delivery_format": "csv",
"columns_order": [
"attribution_attribution_action",
"attribution_campaign",
"attribution_campaign_id",
"attribution_creative",
"attribution_date_adjusted",
"attribution_date_utc",
"attribution_matched_by",
"attribution_matched_to",
"attribution_network",
"attribution_network_id",
"attribution_seconds_since",
"attribution_site_id",
"attribution_site_id",
"attribution_tier",
"attribution_timestamp",
"attribution_timestamp_adjusted",
"attribution_tracker",
"attribution_tracker_id",
"attribution_tracker_name",
"count",
"custom_dimensions",
"device_id_adid",
"device_id_android_id",
"device_id_custom",
"device_id_idfa",
"device_id_idfv",
"device_id_kochava",
"device_os",
"device_type",
"device_version",
"dimension_count",
"dimension_data",
"dimension_sum",
"event_name",
"event_time_registered",
"geo_city",
"geo_country",
"geo_lat",
"geo_lon",
"geo_region",
"identity_link",
"install_date_adjusted",
"install_date_utc",
"install_device_version",
"install_devices_adid",
"install_devices_android_id",
"install_devices_custom",
"install_devices_email_0",
"install_devices_email_1",
"install_devices_idfa",
"install_devices_ids",
"install_devices_ip", …Run Code Online (Sandbox Code Playgroud)