我正在编写一个简单的 Beam 作业来将数据从 GCS 存储桶复制到 BigQuery。代码如下所示:
from apache_beam.options.pipeline_options import GoogleCloudOptions
import apache_beam as beam
pipeline_options = GoogleCloudOptions(flags=sys.argv[1:])
pipeline_options.project = PROJECT_ID
pipeline_options.region = 'us-west1'
pipeline_options.job_name = JOB_NAME
pipeline_options.staging_location = BUCKET + '/binaries'
pipeline_options.temp_location = BUCKET + '/temp'
schema = 'id:INTEGER,region:STRING,population:INTEGER,sex:STRING,age:INTEGER,education:STRING,income:FLOAT,statusquo:FLOAT,vote:STRING'
p = (beam.Pipeline(options = pipeline_options)
| 'ReadFromGCS' >> beam.io.textio.ReadFromText('Chile.csv')
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('project:tmp.dummy', schema = schema))
Run Code Online (Sandbox Code Playgroud)
我们在项目项目中写入表tmp.dummy的位置。这导致以下堆栈跟踪:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
mod_name, loader, code, fname = _get_module_details(mod_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line …Run Code Online (Sandbox Code Playgroud)