And*_*kes 9 google-bigquery google-cloud-platform
我仍然无法加入重复的嵌套字段,同时仍保留BigQuery中的原始行结构.
在我的例子,我会打电话被连接的两个表A和B.
表A中的记录如下所示:
{
"url":"some url",
"repeated_nested": [
{"key":"some key","property":"some property"}
]
}
Run Code Online (Sandbox Code Playgroud)
表B中的记录如下所示:
{
"key":"some key",
"property2": "another property"
}
Run Code Online (Sandbox Code Playgroud)
我希望找到一种方法将这些数据连接在一起以生成如下所示的行:
{
"url":"some url",
"repeated_nested": [
{
"key":"some key",
"property":"some property",
"property2":"another property"
}
]
}
Run Code Online (Sandbox Code Playgroud)
我尝试的第一个查询是:
SELECT
url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM A
AS lefttable
LEFT OUTER JOIN B
AS righttable
ON lefttable.key=righttable.key
Run Code Online (Sandbox Code Playgroud)
这不起作用,因为BQ无法连接重复的嵌套字段.每行没有唯一标识符.如果我做一个FLATTEN上repeated_nested,然后我不知道如何让原始行一起放回正确.
数据使得a url将始终具有相同的repeated_nested字段.因此,我能够使用UDF进行解决方法,将这个重复的嵌套对象整理成JSON字符串,然后再次展开它:
SELECT url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM
JS(
(
SELECT basetable.url as url, repeated_nested
FROM A as basetable
LEFT JOIN (
SELECT url, CONCAT("[", GROUP_CONCAT_UNQUOTED(repeated_nested_json, ","), "]") as repeated_nested
FROM
(
SELECT
url,
CONCAT(
'{"key": "', repeated_nested.key, '",',
' "property": "', repeated_nested.property, '",',
' "property2": "', mapping_table.property2, '"',
'}'
)
) as repeated_nested_json
FROM (
SELECT
url, repeated_nested.key, repeated_nested.property
FROM A
GROUP BY url, repeated_nested.key, repeated_nested.property
) as urltable
LEFT OUTER JOIN [SDF.alchemy_to_ric]
AS mapping_table
ON urltable.repeated_nested.key=mapping_table.key
)
GROUP BY url
) as companytable
ON basetable.url = urltable.url
),
// input columns:
url, repeated_nested_json,
// output schema:
"[{'name': 'url', 'type': 'string'},
{'name': 'repeated_nested_json', 'type': 'RECORD', 'mode':'REPEATED', 'fields':
[ { 'name': 'key', 'type':'string' },
{ 'name': 'property', 'type':'string' },
{ 'name': 'property2', 'type':'string' }]
}]",
// UDF:
"function(row, emit) {
parsed_repeated_nested = [];
try {
if ( row.repeated_nested_json != null ) {
parsed_repeated_nested = JSON.parse(row.repeated_nested_json);
}
} catch (ex) { }
emit({
url: row.url,
repeated_nested: parsed_repeated_nested
});
}"
)
Run Code Online (Sandbox Code Playgroud)
此解决方案适用于小型表.但是我正在使用的真实生活表中的列数比上面的例子多了很多.当还有其他字段时url,repeated_nested_json它们都必须通过UDF传递.当我使用大约50 gb范围的表时,一切都很好.但是,当我将UDF和查询应用于500-1000 GB的表时,我从BQ获得内部服务器错误.
最后,我只需要GCS中新行分隔的JSON格式的所有数据.作为最后的努力,我尝试将所有字段连接成一个JSON字符串(这样我只有一列)希望我可以将其导出为CSV并拥有我需要的东西.但是,导出过程转义了双引号并在JSON字符串周围添加了双引号.根据BQ工作文档(https://cloud.google.com/bigquery/docs/reference/v2/jobs),有一个属性configuration.query.tableDefinitions.(key).csvOptions.quote可以帮助我.但我无法弄清楚如何让它发挥作用.
有没有人就他们如何应对这种情况提出建议?
| 归档时间: |
|
| 查看次数: |
1408 次 |
| 最近记录: |