cgn*_*utt 2 sql data-analysis cross-join aggregation google-bigquery
假设您在BigQuery中有以下表格:
A = user1 | 0 0 |
user2 | 0 3 |
user3 | 4 0 |
Run Code Online (Sandbox Code Playgroud)
交叉加入后,你有
dist = |user1 user2 0 0 , 0 3 | #comma is just showing user val seperation
|user1 user3 0 0 , 4 0 |
|user2 user3 0 3 , 4 0 |
Run Code Online (Sandbox Code Playgroud)
如何在BigQuery中执行行聚合以计算跨行的成对聚合.作为典型用例,您可以计算两个用户之间的欧氏距离.我想在两个用户之间计算以下指标:
sum(min(user1_row[i], user2_row[i]) / abs(user1_row[i] - user2_row[i]))
Run Code Online (Sandbox Code Playgroud)
为每对用户总结了所有i.
例如,在Python中,您只需:
for i in np.arange(row_length/2)]):
dist.append([user1, user2, np.sum(min(r1[i], r2[i]) / abs(r1[i] - r2[i]))])
Run Code Online (Sandbox Code Playgroud)
从丑陋的方式开始:你可以将数学变成查询.也就是说,转换
for i in ... sum(min(...)/abs(...))为在每个字段上运行的SQL.请注意,MIN并且SUM是您不想使用的聚合函数.而是+用于SUM和IF(a < b, a, b)for MIN.ABS(a, b)看起来像IF(a < b, b-a, a-b).如果你只是计算欧几里德距离,你可以做到
SELECT left.user, right.user,
SQRT((left.x-right.x)*(left.x-right.x)
+ (left.y-right.y)*(left.y-right.y)
+ (left.z-right.z)*(left.z-right.z)) as dist
FROM (
SELECT *
FROM dataset.table1 AS left
CROSS JOIN dataset.table1 AS right)
Run Code Online (Sandbox Code Playgroud)
更好的方法是用户定义函数,并将向量创建为重复值.然后,您可以编写一个DISTANCE()函数,对交叉连接的左侧和右侧的两个数组执行计算.如果您不在UDF测试计划中并想加入,请联系google云支持.
最后,如果您将架构更改{user:string, field1:float, field2:float, field3:float,...}为{user:string, fields:[field:float]}
然后,您可以使用位置展平该字段并对其进行交叉连接.如:
SELECT
user,
field,
index,
FROM (FLATTEN((
SELECT
user,
fields.field as field,
POSITION(fields.field) as index,
from [dataset1.table1]
), fields))
Run Code Online (Sandbox Code Playgroud)
如果将其另存为视图,请将其命名为"dataset1.flat_view"
然后你可以加入:
SELECT left.user as user1, right.user as user2,
left.field as l, right.field as r,
FROM dataset1.flat_view left
JOIN dataset1.flat_view right
ON left.index = right.index
WHERE left.user != right.user
Run Code Online (Sandbox Code Playgroud)
这将为每对用户和每个字段匹配字段分别提供一行.您可以将其保存为视图"dataset1.joined_view".
最后,您可以进行聚合:
既然你想要这个:
sum(min(user1_row[i], user2_row[i]) / abs(user1_row[i] - user2_row[i]))
Run Code Online (Sandbox Code Playgroud)
它看起来像:
SELECT user1, user2,
SUM((if (l < r, l, r)) / (if (l > r, l-r, r-l))
FROM [dataset1.joined_view]
GROUP EACH BY user1, user2
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
443 次 |
| 最近记录: |