如何使用基于多列的bigquery关联?

ali*_*ali 5 google-bigquery

给定100k行和100列的数据集,如何使用bigquery CORR()来查找行之间的相关性?

架构是:

id:integer, feature1:float, feature2:float, ..., feature100:float
Run Code Online (Sandbox Code Playgroud)

编辑这不是滚动窗口时间序列相关问题.每行是对100个特征的观察,我想使用bigquery来查找每行的前N个相似观察.

Fel*_*ffa 8

您想找到每列与其他列之间的相关性吗?

那将是这样的:

SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]
Run Code Online (Sandbox Code Playgroud)

这可能需要很长时间才能编写(除非您将其自动化)。作为替代方案,请考虑一个不同的模式,其中所有内容都位于 3 列中。转换将像这样运行:

SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])
Run Code Online (Sandbox Code Playgroud)

With this schema you can run all the combined column correlations with a simpler query:

SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname
Run Code Online (Sandbox Code Playgroud)

(That's what I did on the article linked by @Tjorriemorrie - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)

Note that the first query might be more complex that this last one, but I suspect it will take less time to run, as no shuffling will be required.

Since this question asks about rows, the initial transformation would be similar, but slightly different:

SELECT column, value, rowid FROM
  (SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
  (SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
  (SELECT 'c3' column, c3 AS value, rowid FROM [mytable]) 
Run Code Online (Sandbox Code Playgroud)

Then the correlation between rows would be computed as in:

SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid
Run Code Online (Sandbox Code Playgroud)