您可以使用 Python 和 load_table_from_json() 将 JSON 格式的数据加载到 BigQuery 表吗？

Question

您可以使用 Python 和 load_table_from_json() 将 JSON 格式的数据加载到 BigQuery 表吗？

Fab*_*Dot 2 json python-3.x pandas google-bigquery

问题：

我正在尝试使用 Python 3.7 将 JSON 对象加载到 BigQuery 表。

从阅读谷歌文档来看，该google-cloud-bigquery模块看起来有一个方法可以做我想要的：load_table_from_json(). 但是，当我尝试在 Python 脚本中实现此方法时，我在 Python shell 中返回以下错误：

BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.

Run Code Online (Sandbox Code Playgroud)

当我在 BigQuery 中查看工作历史时，我还有一些额外的信息：

Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0

Run Code Online (Sandbox Code Playgroud)

和

Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.

Run Code Online (Sandbox Code Playgroud)

这是我正在运行的 Python 脚本的语法：

import pandas as pd
import numpy as np
from google.cloud import bigquery
import os

### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
    formatted_schema = []
    for row in schema:
        formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
    return formatted_schema

### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])

### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"

### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name   
table_schema = {
          'name': 'id',
          'type': 'INTEGER',
          'mode': 'REQUIRED'
          }, {
          'name': 'first_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }, {
          'name': 'last_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }

project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'

client  = bigquery.Client(project = project_id)
dataset  = client.dataset(dataset_id)
table = dataset.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_data, table, job_config = job_config)

print(job.result())

Run Code Online (Sandbox Code Playgroud)

据我从文档中可以看出，这应该有效 - 但事实并非如此。

我怀疑问题出在 JSON 对象上json_data，而 BigQuery 可能不喜欢该值的某些内容：[{"id":2,"first_name":"Jane","last_name":"Doe"}]. 即使我传递lines带有True它的值的参数也没有区别 - JSON 对象没有方括号：{"id":2,"first_name":"Jane","last_name":"Doe"}

我还尝试使用文档中orient描述的参数的替代值，但它们都会产生与上述值相同的错误。to_json() records

我也试过注释掉这一行，job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON但得到了同样的错误。

解决方法

有趣的是，在load_table_from_json() 文档中有一个注释指出：

如果您的数据已经是换行符分隔的 JSON 字符串，最好将其包装成一个类文件对象并将其传递给 load_table_from_file()：

import io
from google.cloud import bigquery

data = u'{"foo": "bar"}'
data_as_file = io.StringIO(data)

client = bigquery.Client()
client.load_table_from_file(data_as_file, ...)

Run Code Online (Sandbox Code Playgroud)

如果我将此应用于我的脚本并尝试加载数据，则一切正常。这表明与 BigQuery 的连接工作正常，问题确实在于以原始形式加载 JSON。没有提到弃用load_table_from_json()支持，load_table_from_file()为什么它不起作用？

工作代码：

作为参考，以下是使用该load_table_from_file()方法将数据加载到 BigQuery的脚本版本：

import pandas as pd
import numpy as np
from google.cloud import bigquery
import os
import io

def format_schema(schema):
    formatted_schema = []
    for row in schema:
        formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))   
    return formatted_schema

df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])

### Additional parameter used to convert to newline delimited format
json_data = df.to_json(orient = 'records', lines = True)
stringio_data = io.StringIO(json_data)

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"

table_schema = {
          'name': 'id',
          'type': 'INTEGER',
          'mode': 'REQUIRED'
          }, {
          'name': 'first_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }, {
          'name': 'last_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }

project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'

client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)

job = client.load_table_from_file(stringio_data, table, job_config = job_config)
print(job.result())

Run Code Online (Sandbox Code Playgroud)

Answer 1

rme*_*ves 5

该函数client.load_table_from_file需要一个JSON对象而不是一个STRING 要修复它，您可以执行以下操作：

import json

Run Code Online (Sandbox Code Playgroud)

从 Pandas 创建 JSON 字符串后，您应该执行以下操作：

json_object = json.loads(json_data)

Run Code Online (Sandbox Code Playgroud)

最后你应该使用你的 JSON 对象：

job = client.load_table_from_json(json_object, table, job_config = job_config)

Run Code Online (Sandbox Code Playgroud)

所以你的代码会是这样的：

import pandas as pd
import numpy as np
from google.cloud import bigquery
import os, json

### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
    formatted_schema = []
    for row in schema:
        formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
    return formatted_schema

### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])

### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')
json_object = json.loads(json_data)

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"

### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name   
table_schema = {
          'name': 'id',
          'type': 'INTEGER',
          'mode': 'REQUIRED'
          }, {
          'name': 'first_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }, {
          'name': 'last_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }

project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'

client  = bigquery.Client(project = project_id)
dataset  = client.dataset(dataset_id)
table = dataset.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_object, table, job_config = job_config)

print(job.result())

Run Code Online (Sandbox Code Playgroud)

请让我知道它是否对您有帮助

归档时间：	6 年前
查看次数：	9082 次
最近记录：	6 年前