How to write array of string values from Pandas to Google Big Query

obk*_*mrk 6 python arrays python-3.x pandas google-bigquery

Im currently trying to write a Pandas Dataframe (Python 3.x) into Google Big Query. The table has a column with dtype object that contains an array of string values.

sample of pandas table I aim to create a BQ table that maintains a nested table structure as below: sample of Big Query table with following schema: schema of Big Query table

Im using the google-cloud-bigquery library as that allows the df to convert to the Parquet format that per documentation supports nested array values:

code used:

client = bigquery.Client()
table_id = 'dataset.table'

job_config = bigquery.LoadJobConfig(
    schema = [
            bigquery.SchemaField('route_id', 'INTEGER'),
            bigquery.SchemaField('types', 'STRING', mode='REPEATED')
    ], 
    writeDisposition="WRITE_APPEND"
)

job = client.load_table_from_dataframe(
    df, 
    table_id, 
    job_config=job_config,
)

# Wait for the load job to complete.
job.result()
Run Code Online (Sandbox Code Playgroud)

but unfortunately Im getting the following error message returned:

BadRequest: 400 Error while reading data, error message: Provided schema is not compatible with the file 'prod-scotty-76a528bc-407d-4224-8951-c8ff0c71faa1'. Field 'types' is specified as REPEATED in provided schema which does not match NULLABLE as specified in the file.

What has been tried so far:

  1. used RECORD field type

but that caused the following error: https://github.com/googleapis/python-bigquery/issues/21

  1. 根本不在 python 中传送任何模式(并允许 Python/BQ 自行对其进行排序)

令人惊讶的是,这适用于第一次迭代(CREATE_IF_NEEDED),在 BQ 中创建一个表,该表维护自动应用以下架构的嵌套结构: BQ 表的自动应用架构,但如果您尝试再次追加甚至完全相同的表,则会失败返回相同的表错误如下1。

有什么建议或提示吗?

Alb*_*esa 0

似乎有不匹配的地方,但尚未解决。

\n\n

我已经能够使用开源库pandas-gcp正确上传带有数组的数据帧正确上传带有数组的数据帧:

\n\n
import pandas as pd\nimport pandas_gbq\n\nd = {\'nested_string\': [[\'hi\', \'keloke\'], [\'io\', \'ready\']], \'route_id\': [83833, 4487]}\ndf = pd.DataFrame(data = d)\n\ntable_id = "dataset.table"\nproject_id = \'my_project\'\n\npandas_gbq.to_gbq(\n    df, table_id, project_id=project_id, if_exists=\'replace\',\n)\n
Run Code Online (Sandbox Code Playgroud)\n\n

无需第三方工具的其他可能的解决方法:

\n\n

\xc2\xb7 使用数据流代替

\n\n

\xc2\xb7 从 python 文件中,将数据帧以 csv 格式保存在 Google 存储桶中,并从 BigQuery 中提取它

\n\n

您认为这些对您有用吗?

\n