Apache Beam BigQueryIO 写入缓慢

jim*_*mmy 9 google-bigquery apache-beam

我的 Beam 管道正在写入未分区的 BigQuery 目标表。PCollection 由数百万个 TableRow 组成。如果我使用 DirectRunner 运行 BigQueryIO,它显然会首先为 BigQueryWriteTemp 临时文件夹中的每条记录创建一个临时文件。这显然表现不佳。我在这里做错了吗?这是一个正常的批处理作业,而不是流式处理。(使用 DataflowRunner 运行的相同作业似乎没有这样做)

myrows.apply("WriteToBigQuery",
                BigQueryIO.writeTableRows().to(BQ_TARGET_TABLE)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
Run Code Online (Sandbox Code Playgroud)

这是我们看到的日志。这些文件中的每一个都包含一个 TableRow。DataflowRunner 上的相同似乎只创建了大约 3 个文件。

2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/59668b03-a1e8-4288-a049-3472e7cb6333.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/feeb454b-799e-4d77-bd12-dec313cdadc2.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3c63db33-787f-4215-a425-3446d92157ed.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/87d55556-e012-4bef-8856-69efd4c5ab26.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5e6bfe94-b1c9-49cb-b0c7-a768d78d85f3.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/b236948b-bdf0-4bfe-9d26-4e67c8904320.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/451abb93-e02a-4210-aa46-5afa0c82547d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/60fd5ecc-8dbe-46e4-884d-3767694b009f.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/f3a5b4e0-e956-4a41-a78d-c7694950b6f1.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/a4e4c74f-d12c-495d-bf28-eb20ee25f086.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/eb3b29e1-cc0c-4a6d-82f4-8527d0c5a51e.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/916ac41b-4ece-42bb-bf24-c5ca17060d1d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5b76128f-3c66-4701-92ce-2d3ba2e91f65.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3a0ae709-756e-452c-9b0f-6efa9c0864ca.
Run Code Online (Sandbox Code Playgroud)

小智 1

直接运行程序用于测试和开发,并包含额外的检查以确保管道在其他运行程序中正确运行。这会带来性能下降的副作用。

以下是额外的检查:

  • 强制元素的不变性
  • 强制元素的可编码性
  • 在所有点上以任意顺序处理元素
  • 用户函数的序列化(DoFn、CombineFn 等)