小编der*_*rdc的帖子

Avro架构定义嵌套类型

我对Avro相当新,并通过嵌套类型的文档.我在下面的示例中运行良好,但模型中的许多不同类型将具有地址.是否可以定义一个address.avsc文件并将其作为嵌套类型引用?如果可以,您是否还可以更进一步,为客户提供地址列表?提前致谢.

{"namespace": "com.company.model",
  "type": "record",
  "name": "Customer",
  "fields": [
    {"name": "firstname", "type": "string"},
    {"name": "lastname", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "phone", "type": "string"},
    {"name": "address", "type":
      {"type": "record",
       "name": "AddressRecord",
       "fields": [
         {"name": "streetaddress", "type": "string"},
         {"name": "city", "type": "string"},
         {"name": "state", "type": "string"},
         {"name": "zip", "type": "string"}
       ]}
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

avro

8
推荐指数
1
解决办法
1万
查看次数

Tesseract带有表或行的文档的OCR文本顺序

我使用Tesseract OCR将扫描的PDF转换为纯文本.总体而言,它非常有效,但我对扫描文本的顺序有疑问.具有表格数据的文档似乎逐列扫描,这似乎是逐行扫描的更自然的方式.一个非常小的例子是:

This is column A, row 1   This is column B, row 1    This is column C, row 1
This is column A, row 2   This is column B, row 2    This is column C, row 2
Run Code Online (Sandbox Code Playgroud)

产生以下文字:

This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2
Run Code Online (Sandbox Code Playgroud)

我开始阅读文档并进行猜测和测试,这里记录了 …

ocr tesseract

6
推荐指数
1
解决办法
1万
查看次数

标签 统计

avro ×1

ocr ×1

tesseract ×1