Fab*_*Dot 2 json python-3.x pandas google-bigquery
我正在尝试使用 Python 3.7 将 JSON 对象加载到 BigQuery 表。
从阅读谷歌文档来看,该google-cloud-bigquery模块看起来有一个方法可以做我想要的:load_table_from_json(). 但是,当我尝试在 Python 脚本中实现此方法时,我在 Python shell 中返回以下错误:
BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Run Code Online (Sandbox Code Playgroud)
当我在 BigQuery 中查看工作历史时,我还有一些额外的信息:
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Run Code Online (Sandbox Code Playgroud)
和
Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.
Run Code Online (Sandbox Code Playgroud)
这是我正在运行的 Python 脚本的语法:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os
### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
formatted_schema = []
for row in schema:
formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
return formatted_schema
### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])
### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"
### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name
table_schema = {
'name': 'id',
'type': 'INTEGER',
'mode': 'REQUIRED'
}, {
'name': 'first_name',
'type': 'STRING',
'mode': 'NULLABLE'
}, {
'name': 'last_name',
'type': 'STRING',
'mode': 'NULLABLE'
}
project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'
client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_data, table, job_config = job_config)
print(job.result())
Run Code Online (Sandbox Code Playgroud)
据我从文档中可以看出,这应该有效 - 但事实并非如此。
我怀疑问题出在 JSON 对象上json_data,而 BigQuery 可能不喜欢该值的某些内容:[{"id":2,"first_name":"Jane","last_name":"Doe"}]. 即使我传递lines带有True它的值的参数也没有区别 - JSON 对象没有方括号:{"id":2,"first_name":"Jane","last_name":"Doe"}
我还尝试使用文档中orient描述的参数的替代值,但它们都会产生与上述值相同的错误。to_json() records
我也试过注释掉这一行,job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON但得到了同样的错误。
有趣的是,在load_table_from_json() 文档中有一个注释指出:
如果您的数据已经是换行符分隔的 JSON 字符串,最好将其包装成一个类文件对象并将其传递给 load_table_from_file():
import io
from google.cloud import bigquery
data = u'{"foo": "bar"}'
data_as_file = io.StringIO(data)
client = bigquery.Client()
client.load_table_from_file(data_as_file, ...)
Run Code Online (Sandbox Code Playgroud)
如果我将此应用于我的脚本并尝试加载数据,则一切正常。这表明与 BigQuery 的连接工作正常,问题确实在于以原始形式加载 JSON。没有提到弃用load_table_from_json()支持,load_table_from_file()为什么它不起作用?
作为参考,以下是使用该load_table_from_file()方法将数据加载到 BigQuery的脚本版本:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os
import io
def format_schema(schema):
formatted_schema = []
for row in schema:
formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
return formatted_schema
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])
### Additional parameter used to convert to newline delimited format
json_data = df.to_json(orient = 'records', lines = True)
stringio_data = io.StringIO(json_data)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"
table_schema = {
'name': 'id',
'type': 'INTEGER',
'mode': 'REQUIRED'
}, {
'name': 'first_name',
'type': 'STRING',
'mode': 'NULLABLE'
}, {
'name': 'last_name',
'type': 'STRING',
'mode': 'NULLABLE'
}
project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'
client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_file(stringio_data, table, job_config = job_config)
print(job.result())
Run Code Online (Sandbox Code Playgroud)
该函数client.load_table_from_file需要一个JSON对象而不是一个STRING
要修复它,您可以执行以下操作:
import json
Run Code Online (Sandbox Code Playgroud)
从 Pandas 创建 JSON 字符串后,您应该执行以下操作:
json_object = json.loads(json_data)
Run Code Online (Sandbox Code Playgroud)
最后你应该使用你的 JSON 对象:
job = client.load_table_from_json(json_object, table, job_config = job_config)
Run Code Online (Sandbox Code Playgroud)
所以你的代码会是这样的:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os, json
### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
formatted_schema = []
for row in schema:
formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
return formatted_schema
### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])
### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')
json_object = json.loads(json_data)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"
### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name
table_schema = {
'name': 'id',
'type': 'INTEGER',
'mode': 'REQUIRED'
}, {
'name': 'first_name',
'type': 'STRING',
'mode': 'NULLABLE'
}, {
'name': 'last_name',
'type': 'STRING',
'mode': 'NULLABLE'
}
project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'
client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_object, table, job_config = job_config)
print(job.result())
Run Code Online (Sandbox Code Playgroud)
请让我知道它是否对您有帮助
| 归档时间: |
|
| 查看次数: |
9082 次 |
| 最近记录: |