AWS Glue:爬网程序无法识别 CSV 格式的时间戳列

Wil*_*gar 7 aws-glue

运行 AWS Glue 爬网程序时,它无法识别时间戳列。

我在 CSV 文件中正确格式化了 ISO8601 时间戳。首先,我希望 Glue 自动将这些归类为时间戳,但事实并非如此。

我还尝试从此链接https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html使用自定义时间戳分类器

这是我的分类器的样子

神识分类器

这也没有正确分类我的时间戳。

例如,我已将我的数据放入 grok 调试器(https://grokdebug.herokuapp.com/

id,iso_8601_now,iso_8601_yesterday
0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
Run Code Online (Sandbox Code Playgroud)

它在两者上都匹配

%{TIMESTAMP_ISO8601:时间戳}

%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

import csv
from datetime import datetime, timedelta

with open("timestamp_test.csv", 'w', newline='') as f:
    w = csv.writer(f, delimiter=',')

    w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"])

    for i in range(1000):
        w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()])
Run Code Online (Sandbox Code Playgroud)

我希望 AWS 胶自动将 iso_8601 列分类为时间戳。即使在添加自定义 grok 分类器时,它仍然不会将任一列分类为时间戳。

两列都被归类为字符串。

分类器在爬虫上处于活动状态 分类器激活

爬虫输出timestamp_test表

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "bigint",
                    "comment": ""
                },
                {
                    "name": "iso_8601_now",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "iso_8601_yesterday",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://REDACTED/_csv_timestamp_test/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            "parameters": {
                "field.delim": ","
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "skip.header.line.count": "1",
            "sizeKey": "58926",
            "objectCount": "1",
            "UPDATED_BY_CRAWLER": "REDACTED",
            "CrawlerSchemaSerializerVersion": "1.0",
            "recordCount": "1227",
            "averageRecordSize": "48",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "compressionType": "none",
            "classification": "csv",
            "columnsOrdered": "true",
            "areColumnsQuoted": "false",
            "delimiter": ",",
            "typeOfData": "file"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "skip.header.line.count": "1",
        "sizeKey": "58926",
        "objectCount": "1",
        "UPDATED_BY_CRAWLER": "REDACTED",
        "CrawlerSchemaSerializerVersion": "1.0",
        "recordCount": "1227",
        "averageRecordSize": "48",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "false",
        "delimiter": ",",
        "typeOfData": "file"
    }
}
Run Code Online (Sandbox Code Playgroud)

ya2*_*410 3

根据CREATE TABLE doc,时间戳格式为yyyy-mm-dd hh:mm:ss[.f...]

如果必须使用ISO8601格式,请添加此Serde参数'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'

您可以从 Glue(1) 更改该表或从 Athena(2) 重新创建它:

  1. Glue 控制台 > 表格 > 编辑表格 > 将以上内容添加到 Serde 参数中。您还需要单击“编辑架构”并将数据类型从字符串更改为时间戳
  2. 从 Athena 删除表并运行:
CREATE EXTERNAL TABLE `table1`(
  `id` bigint, 
  `iso_8601_now` timestamp, 
  `iso_8601_yesterday` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
  'field.delim' = ',',
  'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS') 
LOCATION
  's3://REDACTED/_csv_timestamp_test/'
Run Code Online (Sandbox Code Playgroud)

  • 感谢你的回答。我刚刚使用 yyyy-mm-dd hh:mm:ss 中的时间戳尝试了上述测试,Glue Crawler 仍然将我的数据分类为字符串。我需要确保胶水爬虫正确分类时间戳列,而不必手动编辑表定义! (3认同)