与 Levenshtein 距离的模糊连接

Shi*_*gan 0 sql hive presto apache-spark-sql trino

我有一个包含用户名(约 1 000 行)的表,称为“潜在用户”,另一个表称为“实际用户”(约 1000 万行)。所有记录都完全由 [az] 字符组成,没有空格。此外,我知道实际用户表中没有潜在用户。

我希望能够根据 Levenshtein 距离,计算 possible_users 中的每一行,actual_users 中最接近的记录是什么。例如:

| potential_users|
|----------------|
| user1          |
| kajd           |
| bbbbb          |
Run Code Online (Sandbox Code Playgroud)

| actual_users |
|--------------|
| kaj          |
| bbbbbbb      |
| user         |
Run Code Online (Sandbox Code Playgroud)

将返回:

| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1           | user         | 1                    |
| kajd            | kaj          | 1                    |
| bbbbb           | bbbbbbb      | 2                    |
Run Code Online (Sandbox Code Playgroud)

如果表很短,我可以创建一个交叉联接,计算潜在用户中的每条记录与实际用户中的编辑距离,然后返回具有最低值的记录。然而,在我的例子中,这将创建一个 1 000 x 10 000 000 行的中间表,这有点不切实际。

是否有更干净的方法通过创建交叉连接来执行此类操作?

Mar*_*rso 5

不幸的是,如果没有交叉连接就无法做到这一点。归根结底,每个潜在用户都需要针对每个实际用户进行测试。

\n

然而,Trino(以前称为Presto SQL)将跨多个线程和机器并行执行联接,因此只要有足够的硬件,它就可以非常快地执行。请注意,在 Trino 中,中间结果从一个运算符流式传输到另一个运算符,因此该查询不存在具有 10M x 1k 行的“中间表”。

\n

对于像这样的查询

\n
SELECT potential, min_by(actual, distance), min(distance)\nFROM (\n    SELECT *, levenshtein_distance(potential, actual) distance\n    FROM actual_users, potential_users\n)\nGROUP BY potential\n
Run Code Online (Sandbox Code Playgroud)\n

这是查询计划:

\n
                                                   Query Plan                                                   \n----------------------------------------------------------------------------------------------------------------\n Fragment 0 [SINGLE]                                                                                            \n     Output layout: [potential, min_by, min]                                                                    \n     Output partitioning: SINGLE []                                                                             \n     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              \n     Output[potential, _col1, _col2]                                                                            \n     \xe2\x94\x82   Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]                                          \n     \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                \n     \xe2\x94\x82   _col1 := min_by                                                                                        \n     \xe2\x94\x82   _col2 := min                                                                                           \n     \xe2\x94\x94\xe2\x94\x80 RemoteSource[1]                                                                                         \n            Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]                                       \n                                                                                                                \n Fragment 1 [HASH]                                                                                              \n     Output layout: [potential, min_by, min]                                                                    \n     Output partitioning: SINGLE []                                                                             \n     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              \n     Aggregate(FINAL)[potential]                                                                                \n     \xe2\x94\x82   Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]                                          \n     \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                \n     \xe2\x94\x82   min := min("min_1")                                                                                    \n     \xe2\x94\x82   min_by := min_by("min_by_0")                                                                           \n     \xe2\x94\x94\xe2\x94\x80 LocalExchange[HASH] ("potential")                                                                       \n        \xe2\x94\x82   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]    \n        \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             \n        \xe2\x94\x94\xe2\x94\x80 RemoteSource[2]                                                                                      \n               Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))] \n                                                                                                                \n Fragment 2 [SOURCE]                                                                                            \n     Output layout: [potential, min_1, min_by_0]                                                                \n     Output partitioning: HASH [potential]                                                                      \n     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              \n     Aggregate(PARTIAL)[potential]                                                                              \n     \xe2\x94\x82   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]       \n     \xe2\x94\x82   min_1 := min("levenshtein_distance")                                                                   \n     \xe2\x94\x82   min_by_0 := min_by("actual", "levenshtein_distance")                                                   \n     \xe2\x94\x94\xe2\x94\x80 Project[]                                                                                               \n        \xe2\x94\x82   Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]                      \n        \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             \n        \xe2\x94\x82   levenshtein_distance := levenshtein_distance("potential", "actual")                                 \n        \xe2\x94\x94\xe2\x94\x80 CrossJoin                                                                                            \n           \xe2\x94\x82   Layout: [actual:varchar(7), potential:varchar(5)]                                                \n           \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                          \n           \xe2\x94\x82   Distribution: REPLICATED                                                                         \n           \xe2\x94\x9c\xe2\x94\x80 TableScan[memory:9, grouped = false]                                                              \n           \xe2\x94\x82      Layout: [actual:varchar(7)]                                                                   \n           \xe2\x94\x82      Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}                                     \n           \xe2\x94\x82      actual := 0                                                                                   \n           \xe2\x94\x94\xe2\x94\x80 LocalExchange[SINGLE] ()                                                                          \n              \xe2\x94\x82   Layout: [potential:varchar(5)]                                                                \n              \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}                                      \n              \xe2\x94\x94\xe2\x94\x80 RemoteSource[3]                                                                                \n                     Layout: [potential:varchar(5)]                                                             \n                                                                                                                \n Fragment 3 [SOURCE]                                                                                            \n     Output layout: [potential]                                                                                 \n     Output partitioning: BROADCAST []                                                                          \n     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              \n     TableScan[memory:8, grouped = false]                                                                       \n         Layout: [potential:varchar(5)]                                                                         \n         Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}                                              \n         potential := 0                                                                                         \n                                                                                                                \n                                                                                                                \n(1 row)\n\n
Run Code Online (Sandbox Code Playgroud)\n

特别是,对于本部分,一旦交叉连接生成一行,它就会被输入到投影运算符中,计算两个值之间的编辑距离,然后输入到聚合中,每个“潜在”仅存储一个组用户。因此,该查询所需的内存量应该较低。

\n
     Aggregate(PARTIAL)[potential]                                                                              \n     \xe2\x94\x82   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]       \n     \xe2\x94\x82   min_1 := min("levenshtein_distance")                                                                   \n     \xe2\x94\x82   min_by_0 := min_by("actual", "levenshtein_distance")                                                   \n     \xe2\x94\x94\xe2\x94\x80 Project[]                                                                                               \n        \xe2\x94\x82   Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]                      \n        \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             \n        \xe2\x94\x82   levenshtein_distance := levenshtein_distance("potential", "actual")                                 \n        \xe2\x94\x94\xe2\x94\x80 CrossJoin                                                                                            \n           \xe2\x94\x82   Layout: [actual:varchar(7), potential:varchar(5)]                                                \n           \xe2\x94\x82   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                          \n           \xe2\x94\x82   Distribution: REPLICATED \n
Run Code Online (Sandbox Code Playgroud)\n