我对Avro相当新,并通过嵌套类型的文档.我在下面的示例中运行良好,但模型中的许多不同类型将具有地址.是否可以定义一个address.avsc文件并将其作为嵌套类型引用?如果可以,您是否还可以更进一步,为客户提供地址列表?提前致谢.
{"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{"name": "firstname", "type": "string"},
{"name": "lastname", "type": "string"},
{"name": "email", "type": "string"},
{"name": "phone", "type": "string"},
{"name": "address", "type":
{"type": "record",
"name": "AddressRecord",
"fields": [
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "zip", "type": "string"}
]}
}
]
}
Run Code Online (Sandbox Code Playgroud) 我使用Tesseract OCR将扫描的PDF转换为纯文本.总体而言,它非常有效,但我对扫描文本的顺序有疑问.具有表格数据的文档似乎逐列扫描,这似乎是逐行扫描的更自然的方式.一个非常小的例子是:
This is column A, row 1 This is column B, row 1 This is column C, row 1
This is column A, row 2 This is column B, row 2 This is column C, row 2
Run Code Online (Sandbox Code Playgroud)
产生以下文字:
This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2
Run Code Online (Sandbox Code Playgroud)
我开始阅读文档并进行猜测和测试,这里记录了 …