Rut*_*ste 4 regex google-bigquery gdelt
我正在使用本教程使用GDELT数据库探索Google Biguery的功能,但是sql方言处于“传统”状态,我想使用标准方言。
在传统方言中:
SELECT
theme,
COUNT(*) AS count
FROM (
SELECT
REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*',"") theme
from [gdelt-bq:gdeltv2.gkg]
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300
Run Code Online (Sandbox Code Playgroud)
当我尝试翻译成标准方言时:
SELECT
theme,
COUNT(*) AS count
FROM (
SELECT
REGEXP_REPLACE(SPLIT(V2Themes,';') , r',.*', " ") AS theme
FROM
`gdelt-bq.gdeltv2.gkg`
WHERE
DATE>20150302000000
AND DATE < 20150304000000
AND V2Persons LIKE '%Netanyahu%' )
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
300
Run Code Online (Sandbox Code Playgroud)
它会引发以下错误:
No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:5]
Run Code Online (Sandbox Code Playgroud)
看来我必须将SPLIT()操作的结果转换为字符串。我该怎么做呢?
更新:我找到一个讲解不必要的操作的演讲:
SELECT
COUNT(*),
REGEXP_REPLACE(themes,",.*","") AS theme
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST( SPLIT(V2Themes,";") ) AS themes
WHERE
_PARTITIONTIME >= "2018-08-09 00:00:00"
AND _PARTITIONTIME < "2018-08-10 00:00:00"
AND V2Persons LIKE '%Netanyahu%'
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
100
Run Code Online (Sandbox Code Playgroud)
首先展平数组:
SELECT
REGEXP_REPLACE(theme , r',.*', " ") AS theme,
COUNT(*) AS count
FROM
`gdelt-bq.gdeltv2.gkg`,
UNNEST(SPLIT(V2Themes,';')) AS theme
WHERE
DATE>20150302000000
AND DATE < 20150304000000
AND V2Persons LIKE '%Netanyahu%'
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
300
Run Code Online (Sandbox Code Playgroud)
问题中的旧版SQL等效项实际上也具有使数组变平的效果,尽管它隐含在主题的GROUP BY中。