SQL-jaccard 相似度

Asp*_*per 5 sql google-bigquery

我的表如下所示:

author | group 

daniel | group1,group2,group3,group4,group5,group8,group10
adam   | group2,group5,group11,group12
harry  | group1,group10,group15,group13,group15,group18
...
...
Run Code Online (Sandbox Code Playgroud)

我希望我的输出看起来像:

author1 | author2 | intersection | union

daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam   | harry| 0 | 10
Run Code Online (Sandbox Code Playgroud)

谢谢你

Mik*_*ant 5

请尝试以下操作(适用于 BigQuery)

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.item=b.item) AS intersection, 
  EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author 
GROUP BY 1,2
Run Code Online (Sandbox Code Playgroud)

添加了 BigQuery 标准 SQL 的解决方案

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
  SELECT author, SPLIT(grp) AS grp
  FROM YourTable
)
SELECT 
  a.author AS author1, 
  b.author  AS author2,
  (SELECT COUNT(1) FROM a.grp) AS count1,
  (SELECT COUNT(1) FROM b.grp) AS count2,
  (SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
  (SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author
Run Code Online (Sandbox Code Playgroud)

我喜欢这个的原因:

  • 更简单/更友好的代码
  • 不需要 CROSS JOIN 和额外的 GROUP BY

当/如果尝试 -确保取消Use Legacy SQL选中下面的复选框Show Options