Shi*_*gan 0 sql hive presto apache-spark-sql trino
我有一个包含用户名(约 1 000 行)的表,称为“潜在用户”,另一个表称为“实际用户”(约 1000 万行)。所有记录都完全由 [az] 字符组成,没有空格。此外,我知道实际用户表中没有潜在用户。
我希望能够根据 Levenshtein 距离,计算 possible_users 中的每一行,actual_users 中最接近的记录是什么。例如:
| potential_users|
|----------------|
| user1 |
| kajd |
| bbbbb |
Run Code Online (Sandbox Code Playgroud)
和
| actual_users |
|--------------|
| kaj |
| bbbbbbb |
| user |
Run Code Online (Sandbox Code Playgroud)
将返回:
| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1 | user | 1 |
| kajd | kaj | 1 |
| bbbbb | bbbbbbb | 2 |
Run Code Online (Sandbox Code Playgroud)
如果表很短,我可以创建一个交叉联接,计算潜在用户中的每条记录与实际用户中的编辑距离,然后返回具有最低值的记录。然而,在我的例子中,这将创建一个 1 000 x 10 000 000 行的中间表,这有点不切实际。
是否有更干净的方法通过创建交叉连接来执行此类操作?
不幸的是,如果没有交叉连接就无法做到这一点。归根结底,每个潜在用户都需要针对每个实际用户进行测试。
\n然而,Trino(以前称为Presto SQL)将跨多个线程和机器并行执行联接,因此只要有足够的硬件,它就可以非常快地执行。请注意,在 Trino 中,中间结果从一个运算符流式传输到另一个运算符,因此该查询不存在具有 10M x 1k 行的“中间表”。
\n对于像这样的查询
\nSELECT potential, min_by(actual, distance), min(distance)\nFROM (\n SELECT *, levenshtein_distance(potential, actual) distance\n FROM actual_users, potential_users\n)\nGROUP BY potential\nRun Code Online (Sandbox Code Playgroud)\n这是查询计划:
\n Query Plan \n----------------------------------------------------------------------------------------------------------------\n Fragment 0 [SINGLE] \n Output layout: [potential, min_by, min] \n Output partitioning: SINGLE [] \n Stage Execution Strategy: UNGROUPED_EXECUTION \n Output[potential, _col1, _col2] \n \xe2\x94\x82 Layout: [potential:varchar(5), min_by:varchar(7), min:bigint] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?} \n \xe2\x94\x82 _col1 := min_by \n \xe2\x94\x82 _col2 := min \n \xe2\x94\x94\xe2\x94\x80 RemoteSource[1] \n Layout: [potential:varchar(5), min_by:varchar(7), min:bigint] \n \n Fragment 1 [HASH] \n Output layout: [potential, min_by, min] \n Output partitioning: SINGLE [] \n Stage Execution Strategy: UNGROUPED_EXECUTION \n Aggregate(FINAL)[potential] \n \xe2\x94\x82 Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?} \n \xe2\x94\x82 min := min("min_1") \n \xe2\x94\x82 min_by := min_by("min_by_0") \n \xe2\x94\x94\xe2\x94\x80 LocalExchange[HASH] ("potential") \n \xe2\x94\x82 Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?} \n \xe2\x94\x94\xe2\x94\x80 RemoteSource[2] \n Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))] \n \n Fragment 2 [SOURCE] \n Output layout: [potential, min_1, min_by_0] \n Output partitioning: HASH [potential] \n Stage Execution Strategy: UNGROUPED_EXECUTION \n Aggregate(PARTIAL)[potential] \n \xe2\x94\x82 Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))] \n \xe2\x94\x82 min_1 := min("levenshtein_distance") \n \xe2\x94\x82 min_by_0 := min_by("actual", "levenshtein_distance") \n \xe2\x94\x94\xe2\x94\x80 Project[] \n \xe2\x94\x82 Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?} \n \xe2\x94\x82 levenshtein_distance := levenshtein_distance("potential", "actual") \n \xe2\x94\x94\xe2\x94\x80 CrossJoin \n \xe2\x94\x82 Layout: [actual:varchar(7), potential:varchar(5)] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?} \n \xe2\x94\x82 Distribution: REPLICATED \n \xe2\x94\x9c\xe2\x94\x80 TableScan[memory:9, grouped = false] \n \xe2\x94\x82 Layout: [actual:varchar(7)] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B} \n \xe2\x94\x82 actual := 0 \n \xe2\x94\x94\xe2\x94\x80 LocalExchange[SINGLE] () \n \xe2\x94\x82 Layout: [potential:varchar(5)] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?} \n \xe2\x94\x94\xe2\x94\x80 RemoteSource[3] \n Layout: [potential:varchar(5)] \n \n Fragment 3 [SOURCE] \n Output layout: [potential] \n Output partitioning: BROADCAST [] \n Stage Execution Strategy: UNGROUPED_EXECUTION \n TableScan[memory:8, grouped = false] \n Layout: [potential:varchar(5)] \n Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B} \n potential := 0 \n \n \n(1 row)\n\nRun Code Online (Sandbox Code Playgroud)\n特别是,对于本部分,一旦交叉连接生成一行,它就会被输入到投影运算符中,计算两个值之间的编辑距离,然后输入到聚合中,每个“潜在”仅存储一个组用户。因此,该查询所需的内存量应该较低。
\n Aggregate(PARTIAL)[potential] \n \xe2\x94\x82 Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))] \n \xe2\x94\x82 min_1 := min("levenshtein_distance") \n \xe2\x94\x82 min_by_0 := min_by("actual", "levenshtein_distance") \n \xe2\x94\x94\xe2\x94\x80 Project[] \n \xe2\x94\x82 Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?} \n \xe2\x94\x82 levenshtein_distance := levenshtein_distance("potential", "actual") \n \xe2\x94\x94\xe2\x94\x80 CrossJoin \n \xe2\x94\x82 Layout: [actual:varchar(7), potential:varchar(5)] \n \xe2\x94\x82 Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?} \n \xe2\x94\x82 Distribution: REPLICATED \nRun Code Online (Sandbox Code Playgroud)\n