如何将 PCollection 转换为 python 数据流中的列表

Hac*_*ker 2 google-bigquery google-cloud-dataflow apache-beam

我有一个P1包含 ID 字段的PCollection 。我想从 PCollection 中获取完整的 ID 列作为列表,并将此值传递给 BigQuery 查询以过滤一个 BigQuery 表。

这样做的最快和最优化的方法是什么?

我是 Dataflow 和 BigData 的新手。任何人都可以对此提供一些提示吗?

谢谢!

Wil*_*uks 6

For what I understood from your question you want to build the SQL statement given the IDs you have in P1. This is one example of how you can achieve this:

sql = """select ID from `table` WHERE ID IN ({})"""
with beam.Pipeline(options=StandardOptions()) as p:
         (p | 'Create' >> beam.Create(['1', '2', '3']) 
            | 'Combine' >> beam.combiners.ToList()
            | 'Build SQL' >> beam.Map(lambda x: sql.format(','.join(map(lambda x: '"' + x + '"', x))))
            | 'Save' >> beam.io.WriteToText('results.csv'))
Run Code Online (Sandbox Code Playgroud)

Results:

select ID from `table` WHERE ID IN ("1","2","3")
Run Code Online (Sandbox Code Playgroud)

The operation beam.combiners.ToList() transforms your whole PCollection data into a single list (which I used later on to inject in the SQL placeholder).

You can now use the SQL in the file results.csv-00000-to-000001 to run this query against BQ.

I'm not sure if it's possible to run this query directly in the PCollection though (something like (p | all transformations | beam.io.Write(beam.io.BigQuerySink(result sql)) ). I suppose reading from the end result file and then issuing the query against BQ would be the best approach here.