如何在Avro中将记录与地图混合?

sou*_*ine 9 avro

我正在处理JSON格式的服务器日志,我想以Parquet格式将我的日志存储在AWS S3上(而Parquet需要Avro架构).首先,所有日志都有一组共同的字段,其次,所有日志都有很多可选字段,这些字段不在公共集中.

例如,以下是三个日志:

{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}
Run Code Online (Sandbox Code Playgroud)

所有这三个日志的有3个共享字段:ip,timestampmessage,一些日志的具有附加字段,如microsecondsthread.

如果我使用以下架构,那么我将丢失所有其他字段:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"}
 ]
}
Run Code Online (Sandbox Code Playgroud)

以下架构工作正常:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"},
     {"name": "microseconds", "type": [null,long]},
     {"name": "thread", "type": [null,string]}
 ]
}
Run Code Online (Sandbox Code Playgroud)

但唯一的问题是我不知道所有可选字段的名称,除非我扫描所有日志,此外,将来会有新的附加字段.

然后,我想的是,结合的想法recordmap:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"},
     {"type": "map", "values": "string"}  // error
 ]
}
Run Code Online (Sandbox Code Playgroud)

不幸的是,这不会编译:

java -jar avro-tools-1.7.7.jar compile schema example.avro .
Run Code Online (Sandbox Code Playgroud)

它会抛出一个错误:

Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
    at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
    at org.apache.avro.Schema.parse(Schema.java:1192)
    at org.apache.avro.Schema$Parser.parse(Schema.java:965)
    at org.apache.avro.Schema$Parser.parse(Schema.java:932)
    at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
    at org.apache.avro.tool.Main.run(Main.java:84)
    at org.apache.avro.tool.Main.main(Main.java:73)
Run Code Online (Sandbox Code Playgroud)

有没有办法以Avro格式存储JSON字符串,这些字符串可以灵活地处理未知的可选字段?

基本上这是一个模式演化问题,Spark可以通过Schema Merging来解决这个问题.我正在寻求Hadoop的解决方案.

oak*_*kad 13

地图类型是avro术语中的"复杂"类型.以下代码段有效:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
   {"name": "ip", "type": "string"},
   {"name": "timestamp",  "type": "string"},
   {"name": "message", "type": "string"},
   {"name": "additional", "type": {"type": "map", "values": "string"}}
  ]
}
Run Code Online (Sandbox Code Playgroud)