Hac*_*ker 2 google-bigquery google-cloud-dataflow apache-beam
我有一个P1
包含 ID 字段的PCollection 。我想从 PCollection 中获取完整的 ID 列作为列表,并将此值传递给 BigQuery 查询以过滤一个 BigQuery 表。
这样做的最快和最优化的方法是什么?
我是 Dataflow 和 BigData 的新手。任何人都可以对此提供一些提示吗?
谢谢!
For what I understood from your question you want to build the SQL statement given the IDs you have in P1
. This is one example of how you can achieve this:
sql = """select ID from `table` WHERE ID IN ({})"""
with beam.Pipeline(options=StandardOptions()) as p:
(p | 'Create' >> beam.Create(['1', '2', '3'])
| 'Combine' >> beam.combiners.ToList()
| 'Build SQL' >> beam.Map(lambda x: sql.format(','.join(map(lambda x: '"' + x + '"', x))))
| 'Save' >> beam.io.WriteToText('results.csv'))
Run Code Online (Sandbox Code Playgroud)
Results:
select ID from `table` WHERE ID IN ("1","2","3")
Run Code Online (Sandbox Code Playgroud)
The operation beam.combiners.ToList()
transforms your whole PCollection data into a single list (which I used later on to inject in the SQL placeholder).
You can now use the SQL in the file results.csv-00000-to-000001
to run this query against BQ.
I'm not sure if it's possible to run this query directly in the PCollection though (something like (p | all transformations | beam.io.Write(beam.io.BigQuerySink(result sql))
). I suppose reading from the end result file and then issuing the query against BQ would be the best approach here.
归档时间: |
|
查看次数: |
4088 次 |
最近记录: |