所有执行器均已死亡 MinHash LSH PySpark approxSimilarityJoin EMR 集群上的自连接

thi*_*vdp 2 garbage-collection minhash amazon-emr apache-spark-sql pyspark

在 (name_id, name) 组合的数据帧上调用 Spark 的 MinHashLSH 的 approxSimilarityJoin 时,我遇到了问题。

我尝试解决的问题的摘要:

我有一个包含大约 3000 万个公司名称唯一(name_id、name)组合的数据框。其中一些名称指的是同一家公司,但 (i) 拼写错误,和/或 (ii) 包含其他名称。对每个组合执行模糊字符串匹配是不可能的。为了减少模糊字符串匹配组合的数量,我在 Spark 中使用 MinHashLSH。我的预期方法是使用具有相对较大 Jaccard 阈值的 approxSimilarityJoin(自连接),这样我就能够对匹配的组合运行模糊匹配算法,以进一步改善歧义消除。

我所采取的步骤摘要:

  1. 使用 CountVectorizer 为每个名称创建字符计数向量,
  2. 使用 MinHashLSH 及其 approxSimilarityJoin 并进行以下设置:
    • 哈希表数=100
    • 阈值=0.3(approxSimilarityJoin 的杰卡德阈值)
  3. 在 approxSimilarityJoin 之后,我删除重复的组合(其中认为存在匹配的组合 (i,j) 和 (j,i),然后删除 (j,i))
  4. 删除重复的组合后,我使用 FuzzyWuzzy 包运行模糊字符串匹配算法,以减少记录数量并提高名称的歧义性。
  5. 最终,我在剩余的边 (i,j) 上运行连接组件算法,以匹配哪些公司名称属于一起。

使用的部分代码:

    id_col = 'id'
    name_col = 'name'
    num_hastables = 100
    max_jaccard = 0.3
    fuzzy_threshold = 90
    fuzzy_method = fuzz.token_set_ratio

    # Calculate edges using minhash practices
    edges = MinHashLSH(inputCol='vectorized_char_lst', outputCol='hashes', numHashTables=num_hastables).\
        fit(data).\
        approxSimilarityJoin(data, data, max_jaccard).\
        select(col('datasetA.'+id_col).alias('src'),
               col('datasetA.clean').alias('src_name'),
               col('datasetB.'+id_col).alias('dst'),
               col('datasetB.clean').alias('dst_name')).\
        withColumn('comb', sort_array(array(*('src', 'dst')))).\
        dropDuplicates(['comb']).\
        rdd.\
        filter(lambda x: fuzzy_method(x['src_name'], x['dst_name']) >= fuzzy_threshold if x['src'] != x['dst'] else False).\
        toDF().\
        drop(*('src_name', 'dst_name', 'comb'))
Run Code Online (Sandbox Code Playgroud)

解释一下计划edges

== Physical Plan ==
*(5) HashAggregate(keys=[datasetA#232, datasetB#263], functions=[])
+- Exchange hashpartitioning(datasetA#232, datasetB#263, 200)
   +- *(4) HashAggregate(keys=[datasetA#232, datasetB#263], functions=[])
      +- *(4) Project [datasetA#232, datasetB#263]
         +- *(4) BroadcastHashJoin [entry#233, hashValue#234], [entry#264, hashValue#265], Inner, BuildRight, (UDF(datasetA#232.vectorized_char_lst, datasetB#263.vectorized_char_lst) < 0.3)
            :- *(4) Project [named_struct(id, id#10, name, name#11, clean, clean#90, char_lst, char_lst#95, vectorized_char_lst, vectorized_char_lst#107, hashes, hashes#225) AS datasetA#232, entry#233, hashValue#234]
            :  +- *(4) Filter isnotnull(hashValue#234)
            :     +- Generate posexplode(hashes#225), [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, hashes#225], false, [entry#233, hashValue#234]
            :        +- *(1) Project [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, UDF(vectorized_char_lst#107) AS hashes#225]
            :           +- InMemoryTableScan [char_lst#95, clean#90, id#10, name#11, vectorized_char_lst#107]
            :                 +- InMemoryRelation [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107], StorageLevel(disk, memory, deserialized, 1 replicas)
            :                       +- *(4) Project [id#10, name#11, pythonUDF0#114 AS clean#90, pythonUDF2#116 AS char_lst#95, UDF(pythonUDF2#116) AS vectorized_char_lst#107]
            :                          +- BatchEvalPython [<lambda>(name#11), <lambda>(<lambda>(name#11)), <lambda>(<lambda>(name#11))], [id#10, name#11, pythonUDF0#114, pythonUDF1#115, pythonUDF2#116]
            :                             +- SortAggregate(key=[name#11], functions=[first(id#10, false)])
            :                                +- *(3) Sort [name#11 ASC NULLS FIRST], false, 0
            :                                   +- Exchange hashpartitioning(name#11, 200)
            :                                      +- SortAggregate(key=[name#11], functions=[partial_first(id#10, false)])
            :                                         +- *(2) Sort [name#11 ASC NULLS FIRST], false, 0
            :                                            +- Exchange RoundRobinPartitioning(8)
            :                                               +- *(1) Filter AtLeastNNulls(n, id#10,name#11)
            :                                                  +- *(1) FileScan csv [id#10,name#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:<path>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
            +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, int, false], input[2, vector, true]))
               +- *(3) Project [named_struct(id, id#10, name, name#11, clean, clean#90, char_lst, char_lst#95, vectorized_char_lst, vectorized_char_lst#107, hashes, hashes#256) AS datasetB#263, entry#264, hashValue#265]
                  +- *(3) Filter isnotnull(hashValue#265)
                     +- Generate posexplode(hashes#256), [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, hashes#256], false, [entry#264, hashValue#265]
                        +- *(2) Project [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, UDF(vectorized_char_lst#107) AS hashes#256]
                           +- InMemoryTableScan [char_lst#95, clean#90, id#10, name#11, vectorized_char_lst#107]
                                 +- InMemoryRelation [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107], StorageLevel(disk, memory, deserialized, 1 replicas)
                                       +- *(4) Project [id#10, name#11, pythonUDF0#114 AS clean#90, pythonUDF2#116 AS char_lst#95, UDF(pythonUDF2#116) AS vectorized_char_lst#107]
                                          +- BatchEvalPython [<lambda>(name#11), <lambda>(<lambda>(name#11)), <lambda>(<lambda>(name#11))], [id#10, name#11, pythonUDF0#114, pythonUDF1#115, pythonUDF2#116]
                                             +- SortAggregate(key=[name#11], functions=[first(id#10, false)])
                                                +- *(3) Sort [name#11 ASC NULLS FIRST], false, 0
                                                   +- Exchange hashpartitioning(name#11, 200)
                                                      +- SortAggregate(key=[name#11], functions=[partial_first(id#10, false)])
                                                         +- *(2) Sort [name#11 ASC NULLS FIRST], false, 0
                                                            +- Exchange RoundRobinPartitioning(8)
                                                               +- *(1) Filter AtLeastNNulls(n, id#10,name#11)
                                                                  +- *(1) FileScan csv [id#10,name#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:<path>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>

Run Code Online (Sandbox Code Playgroud)

看起来怎么样data

+-------+--------------------+--------------------+--------------------+--------------------+
|     id|                name|               clean|            char_lst| vectorized_char_lst|
+-------+--------------------+--------------------+--------------------+--------------------+
|3633038|MURATA MACHINERY LTD|    MURATA MACHINERY|[M, U, R, A, T, A...|(33,[0,1,2,3,4,5,...|
|3632811|SOCIETE ANONYME D...|SOCIETE ANONYME D...|[S, O, C, I, E, T...|(33,[0,1,2,3,4,5,...|
|3632655|FUJIFILM CORPORATION|            FUJIFILM|[F, U, J, I, F, I...|(33,[3,10,12,13,2...|
|3633318|HEINE OPTOTECHNIK...|HEINE OPTOTECHNIK...|[H, E, I, N, E,  ...|(33,[0,1,2,3,4,5,...|
|3633523|SUNBEAM PRODUCTS INC|    SUNBEAM PRODUCTS|[S, U, N, B, E, A...|(33,[0,1,2,4,5,6,...|
|3633300|           HIVAL LTD|               HIVAL|     [H, I, V, A, L]|(33,[2,3,10,11,21...|
|3632657|             NSK LTD|                 NSK|           [N, S, K]|(33,[5,6,16],[1.0...|
|3633240|REHABILITATION IN...|REHABILITATION IN...|[R, E, H, A, B, I...|(33,[0,1,2,3,4,5,...|
|3632732|STUDIENGESELLSCHA...|STUDIENGESELLSCHA...|[S, T, U, D, I, E...|(33,[0,1,2,3,4,5,...|
|3632866|ENERGY CONVERSION...|ENERGY CONVERSION...|[E, N, E, R, G, Y...|(33,[0,1,3,5,6,7,...|
|3632895|ERGENICS POWER SY...|ERGENICS POWER SY...|[E, R, G, E, N, I...|(33,[0,1,3,4,5,6,...|
|3632897| MOLI ENERGY LIMITED|         MOLI ENERGY|[M, O, L, I,  , E...|(33,[0,1,3,5,7,8,...|
|3633275| NORDSON CORPORATION|             NORDSON|[N, O, R, D, S, O...|(33,[5,6,7,8,14],...|
|3633256|  PEROXIDCHEMIE GMBH|       PEROXIDCHEMIE|[P, E, R, O, X, I...|(33,[0,3,7,8,9,11...|
|3632695|      POWER CELL INC|          POWER CELL|[P, O, W, E, R,  ...|(33,[0,1,7,8,9,10...|
|3633037|        ERGENICS INC|            ERGENICS|[E, R, G, E, N, I...|(33,[0,3,5,6,8,9,...|
|3632878|  FORD MOTOR COMPANY|          FORD MOTOR|[F, O, R, D,  , M...|(33,[1,4,7,8,13,1...|
|3632573|    SAFT AMERICA INC|        SAFT AMERICA|[S, A, F, T,  , A...|(33,[0,1,2,3,4,6,...|
|3632852|ALCAN INTERNATION...| ALCAN INTERNATIONAL|[A, L, C, A, N,  ...|(33,[0,1,2,3,4,5,...|
|3632698|   KRUPPKOPPERS GMBH|        KRUPPKOPPERS|[K, R, U, P, P, K...|(33,[0,6,7,8,12,1...|
|3633150|ALCAN INTERNATION...| ALCAN INTERNATIONAL|[A, L, C, A, N,  ...|(33,[0,1,2,3,4,5,...|
|3632761|AMERICAN TELEPHON...|AMERICAN TELEPHON...|[A, M, E, R, I, C...|(33,[0,1,2,3,4,5,...|
|3632757|HITACHI KOKI COMP...|        HITACHI KOKI|[H, I, T, A, C, H...|(33,[1,2,3,4,7,9,...|
|3632836|HUGHES AIRCRAFT C...|     HUGHES AIRCRAFT|[H, U, G, H, E, S...|(33,[0,1,2,3,4,6,...|
|3633152|            SOSY INC|                SOSY|        [S, O, S, Y]|(33,[6,7,18],[2.0...|
|3633052|HAMAMATSU PHOTONI...|HAMAMATSU PHOTONI...|[H, A, M, A, M, A...|(33,[1,2,3,4,5,6,...|
|3633450|       AKZO NOBEL NV|          AKZO NOBEL|[A, K, Z, O,  , N...|(33,[0,1,2,5,7,10...|
|3632713| ELTRON RESEARCH INC|     ELTRON RESEARCH|[E, L, T, R, O, N...|(33,[0,1,2,4,5,6,...|
|3632533|NEC ELECTRONICS C...|     NEC ELECTRONICS|[N, E, C,  , E, L...|(33,[0,1,3,4,5,6,...|
|3632562| TARGETTI SANKEY SPA| TARGETTI SANKEY SPA|[T, A, R, G, E, T...|(33,[0,1,2,3,4,5,...|
+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 30 rows
Run Code Online (Sandbox Code Playgroud)

使用的硬件:

  1. 主节点:m5.2xlarge 8 vCore,32 GiB 内存,仅 EBS 存储 EBS 存储:128 GiB
  2. 从节点 (10x):m5.4xlarge 16 vCore、64 GiB 内存、仅 EBS 存储 EBS 存储:500 GiB

使用的 Spark 提交设置:

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.shuffle.partitions=2000" --conf "spark.executor.cores=4" --conf "spark.executor.memory=14g" --conf "spark.driver.memory=14g" --conf "spark.driver.maxResultSize=14g" --conf "spark.dynamicAllocation.enabled=false" --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 run_disambiguation.py
Run Code Online (Sandbox Code Playgroud)

Web UI 中的任务错误

ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Slave lost
Run Code Online (Sandbox Code Playgroud)
ExecutorLostFailure (executor 31 exited unrelated to the running tasks) Reason: Container marked as failed: container_1590592506722_0001_02_000002 on host: ip-172-31-47-180.eu-central-1.compute.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
Run Code Online (Sandbox Code Playgroud)

(部分)执行者日志:


20/05/27 16:29:09 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (25  times so far)
20/05/27 16:29:13 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (26  times so far)
20/05/27 16:29:15 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (28  times so far)
20/05/27 16:29:17 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (0  time so far)
20/05/27 16:29:28 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (27  times so far)
20/05/27 16:29:28 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (26  times so far)
20/05/27 16:29:33 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (29  times so far)
20/05/27 16:29:38 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (1  time so far)
20/05/27 16:29:42 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (27  times so far)
20/05/27 16:29:46 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (28  times so far)
20/05/27 16:29:53 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (30  times so far)
20/05/27 16:29:57 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (2  times so far)
20/05/27 16:30:00 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (28  times so far)
20/05/27 16:30:05 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (29  times so far)
20/05/27 16:30:10 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (31  times so far)
20/05/27 16:30:15 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (3  times so far)
20/05/27 16:30:19 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (29  times so far)
20/05/27 16:30:22 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (30  times so far)
20/05/27 16:30:29 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (32  times so far)
20/05/27 16:30:32 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (4  times so far)
20/05/27 16:30:39 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (31  times so far)
20/05/27 16:30:39 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (30  times so far)
20/05/27 16:30:46 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (33  times so far)
20/05/27 16:30:47 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (5  times so far)
20/05/27 16:30:55 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (32  times so far)
20/05/27 16:30:59 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (31  times so far)
20/05/27 16:31:03 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (34  times so far)
20/05/27 16:31:06 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (6  times so far)
20/05/27 16:31:13 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (33  times so far)
20/05/27 16:31:14 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (32  times so far)
20/05/27 16:31:22 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (35  times so far)
20/05/27 16:31:24 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (7  times so far)
20/05/27 16:31:30 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (34  times so far)
20/05/27 16:31:32 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (33  times so far)
20/05/27 16:31:41 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (36  times so far)
20/05/27 16:31:44 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (8  times so far)
20/05/27 16:31:47 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (35  times so far)
20/05/27 16:31:48 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (34  times so far)
20/05/27 16:32:02 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (37  times so far)
20/05/27 16:32:03 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (9  times so far)
20/05/27 16:32:04 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (36  times so far)
20/05/27 16:32:08 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (35  times so far)
20/05/27 16:32:19 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (38  times so far)
20/05/27 16:32:20 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (37  times so far)
20/05/27 16:32:21 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (10  times so far)
20/05/27 16:32:26 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (36  times so far)
20/05/27 16:32:37 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (39  times so far)
20/05/27 16:32:37 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (11  times so far)
20/05/27 16:32:38 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (38  times so far)
20/05/27 16:32:45 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (37  times so far)
20/05/27 16:32:51 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (40  times so far)
20/05/27 16:32:56 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (12  times so far)
20/05/27 16:32:58 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (39  times so far)
20/05/27 16:33:03 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (38  times so far)
20/05/27 16:33:08 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (41  times so far)
20/05/27 16:33:13 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (13  times so far)
20/05/27 16:33:15 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (40  times so far)
20/05/27 16:33:20 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (39  times so far)
20/05/27 16:33:26 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (42  times so far)
20/05/27 16:33:30 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (41  times so far)
20/05/27 16:33:31 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (14  times so far)
20/05/27 16:33:36 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (40  times so far)
20/05/27 16:33:46 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1992.0 MB to disk (43  times so far)
20/05/27 16:33:47 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (42  times so far)
20/05/27 16:33:51 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (15  times so far)
20/05/27 16:33:54 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (41  times so far)
20/05/27 16:34:03 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1992.0 MB to disk (43  times so far)
20/05/27 16:34:04 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1992.0 MB to disk (44  times so far)
20/05/27 16:34:08 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (16  times so far)
20/05/27 16:34:14 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (42  times so far)
20/05/27 16:34:16 INFO PythonUDFRunner: Times: total = 774701, boot = 3, init = 10, finish = 774688
20/05/27 16:34:21 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1992.0 MB to disk (44  times so far)
20/05/27 16:34:22 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (17  times so far)
20/05/27 16:34:30 INFO PythonUDFRunner: Times: total = 773372, boot = 2, init = 9, finish = 773361
20/05/27 16:34:32 INFO ShuffleExternalSorter:

thi*_*vdp 5

@lokk3r 的回答确实帮助我找到了正确的方向。然而,在我能够无错误地运行该程序之前,我还必须做一些其他事情。我将分享它们来帮助遇到类似问题的人:

  • 首先,我NGrams按照 @lokk3r 的建议使用而不是仅使用单个字符,以避免 MinHashLSH 算法内出现极端的数据偏差。当使用 4-grams 时,data看起来像:
+------------------------------+-------+------------------------------+------------------------------+------------------------------+
|                          name|     id|                         clean|                   ng_char_lst|           vectorized_char_lst|
+------------------------------+-------+------------------------------+------------------------------+------------------------------+
|     SOCIETE ANONYME DITE SAFT|3632811|     SOCIETE ANONYME DITE SAFT|[  S O C, S O C I, O C I E,...|(1332,[64,75,82,84,121,223,...|
|          MURATA MACHINERY LTD|3633038|              MURATA MACHINERY|[  M U R, M U R A, U R A T,...|(1332,[55,315,388,437,526,5...|
|HEINE OPTOTECHNIK GMBH AND ...|3633318|    HEINE OPTOTECHNIK GMBH AND|[  H E I, H E I N, E I N E,...|(1332,[23,72,216,221,229,34...|
|          FUJIFILM CORPORATION|3632655|                      FUJIFILM|[  F U J, F U J I, U J I F,...|(1332,[157,179,882,1028],[1...|
|          SUNBEAM PRODUCTS INC|3633523|              SUNBEAM PRODUCTS|[  S U N, S U N B, U N B E,...|(1332,[99,137,165,175,187,1...|
| STUDIENGESELLSCHAFT KOHLE MBH|3632732| STUDIENGESELLSCHAFT KOHLE MBH|[  S T U, S T U D, T U D I,...|(1332,[13,14,23,25,43,52,57...|
|REHABILITATION INSTITUTE OF...|3633240|REHABILITATION INSTITUTE OF...|[  R E H, R E H A, E H A B,...|(1332,[20,44,51,118,308,309...|
|           NORDSON CORPORATION|3633275|                       NORDSON|[  N O R, N O R D, O R D S,...|(1332,[45,88,582,1282],[1.0...|
|     ENERGY CONVERSION DEVICES|3632866|     ENERGY CONVERSION DEVICES|[  E N E, E N E R, N E R G,...|(1332,[54,76,81,147,202,224...|
|           MOLI ENERGY LIMITED|3632897|                   MOLI ENERGY|[  M O L, M O L I, O L I  ,...|(1332,[438,495,717,756,1057...|
|    ERGENICS POWER SYSTEMS INC|3632895|        ERGENICS POWER SYSTEMS|[  E R G, E R G E, R G E N,...|(1332,[6,10,18,21,24,35,375...|
|                POWER CELL INC|3632695|                    POWER CELL|[  P O W, P O W E, O W E R,...|(1332,[6,10,18,35,126,169,3...|
|            PEROXIDCHEMIE GMBH|3633256|                 PEROXIDCHEMIE|[  P E R, P E R O, E R O X,...|(1332,[326,450,532,889,1073...|
|            FORD MOTOR COMPANY|3632878|                    FORD MOTOR|[  F O R, F O R D, O R D  ,...|(1332,[156,158,186,200,314,...|
|                  ERGENICS INC|3633037|                      ERGENICS|[  E R G, E R G E, R G E N,...|(1332,[375,642,812,866,1269...|
|              SAFT AMERICA INC|3632573|                  SAFT AMERICA|[  S A F, S A F T, A F T  ,...|(1332,[498,552,1116],[1.0,1...|
|   ALCAN INTERNATIONAL LIMITED|3632598|           ALCAN INTERNATIONAL|[  A L C, A L C A, L C A N,...|(1332,[20,434,528,549,571,7...|
|             KRUPPKOPPERS GMBH|3632698|                  KRUPPKOPPERS|[  K R U, K R U P, R U P P,...|(1332,[664,795,798,1010,114...|
|       HUGHES AIRCRAFT COMPANY|3632752|               HUGHES AIRCRAFT|[  H U G, H U G H, U G H E,...|(1332,[605,632,705,758,807,...|
|AMERICAN TELEPHONE AND TELE...|3632761|AMERICAN TELEPHONE AND TELE...|[  A M E, A M E R, M E R I,...|(1332,[19,86,91,126,128,134...|
+------------------------------+-------+------------------------------+------------------------------+------------------------------+
Run Code Online (Sandbox Code Playgroud)

请注意,我在名称上添加了前导和尾随空格,以确保名称中的单词顺序对于NGrams: 'XX YY'has 3-grams'XX ', 'X Y', ' YY'和 while 'YY XX'has 3-grams来说并不重要'YY ', 'Y X', ' XX'。这意味着两者共享 6 个唯一值中的 0 个NGrams。如果我们使用前导和尾随空格:' XX YY 'has 3-grams ' XX', 'XX ', 'X Y', ' YY', 'YY ',而' YY XX 'has 3-grams ' YY', 'YY ', 'Y X', ' XX', 'XX '。这意味着两者共有 6 个中的 4 个是唯一的NGrams。这意味着在 MinHashLSH 期间两条记录在同一个存储桶中结束的可能性要大得多。

  • 我尝试了n- 输入参数的不同值NGrams。我发现 和n=2仍然n=3给出了如此多的数据偏差,以至于一些 Spark 作业花费了太长时间,而其他作业则在几秒钟内完成。所以你最终会在程序继续之前等待很久。我现在使用n=4,这仍然会产生很大的偏差,但它是可行的。

  • 为了进一步减少数据倾斜的影响,我NGramsCountVectorizerSpark 的方法中使用了一些过于(不)频繁发生的附加过滤。我已进行设置minDF=2,使其过滤掉NGrams仅出现在单个名称中的内容。我这样做是因为你无法根据NGram仅出现在一个名称中的 a 来匹配这些名称。此外,我还进行了设置maxDF=0.001,以过滤掉NGrams出现在超过 0.1% 的名称中的内容。这意味着大约 3000 万个名称NGrams中出现频率高于 30000 个名称的名称将被过滤掉。我认为出现得太频繁NGram不会提供关于哪些名称可以匹配的有用信息。

  • 我通过过滤掉非拉丁(扩展)名称,将唯一名称的数量(首先是 3000 万)减少到 1500 万。我注意到(例如阿拉伯语和中文)字符也导致数据出现很大偏差。由于我主要对消除这些公司名称的歧义不感兴趣,因此我从数据集中忽略了它们。我使用以下正则表达式匹配进行过滤:

re.fullmatch('[\u0020-\u007F\u00A0-\u00FF\u0100-\u017F\u0180-\u024F]+'.encode(), string_to_filter.encode())
Run Code Online (Sandbox Code Playgroud)
  • 这是一个有点直接的建议,但我因为没有看到它而遇到了一些问题。确保在将数据集提供给算法之前对数据集运行过滤器,以过滤掉由于设置或仅仅因为名称较小而MinHashLSH没有剩余的记录。显然这不适用于该算法。NGramsminDFmaxDFMinHashLSH

  • 最后,关于命令的设置spark-submit和EMR集群的硬件设置,我发现我不需要像论坛上的一些答案所建议的那样更大的集群。所有上述更改使程序在集群上完美运行,并使用我原始帖子中提供的设置。减少spark.shuffle.partitionsspark.driver.memoryspark.driver.maxResultSize大大提高了程序的运行时间。我提交的内容spark-submit是:

spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=8g" --conf "spark.driver.maxResultSize=8g" --conf "spark.dynamicAllocation.enabled=false" --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 run_disambiguation.py
Run Code Online (Sandbox Code Playgroud)