thi*_*vdp 2 garbage-collection minhash amazon-emr apache-spark-sql pyspark
在 (name_id, name) 组合的数据帧上调用 Spark 的 MinHashLSH 的 approxSimilarityJoin 时,我遇到了问题。
我尝试解决的问题的摘要:
我有一个包含大约 3000 万个公司名称唯一(name_id、name)组合的数据框。其中一些名称指的是同一家公司,但 (i) 拼写错误,和/或 (ii) 包含其他名称。对每个组合执行模糊字符串匹配是不可能的。为了减少模糊字符串匹配组合的数量,我在 Spark 中使用 MinHashLSH。我的预期方法是使用具有相对较大 Jaccard 阈值的 approxSimilarityJoin(自连接),这样我就能够对匹配的组合运行模糊匹配算法,以进一步改善歧义消除。
我所采取的步骤摘要:
使用的部分代码:
id_col = 'id'
name_col = 'name'
num_hastables = 100
max_jaccard = 0.3
fuzzy_threshold = 90
fuzzy_method = fuzz.token_set_ratio
# Calculate edges using minhash practices
edges = MinHashLSH(inputCol='vectorized_char_lst', outputCol='hashes', numHashTables=num_hastables).\
fit(data).\
approxSimilarityJoin(data, data, max_jaccard).\
select(col('datasetA.'+id_col).alias('src'),
col('datasetA.clean').alias('src_name'),
col('datasetB.'+id_col).alias('dst'),
col('datasetB.clean').alias('dst_name')).\
withColumn('comb', sort_array(array(*('src', 'dst')))).\
dropDuplicates(['comb']).\
rdd.\
filter(lambda x: fuzzy_method(x['src_name'], x['dst_name']) >= fuzzy_threshold if x['src'] != x['dst'] else False).\
toDF().\
drop(*('src_name', 'dst_name', 'comb'))
Run Code Online (Sandbox Code Playgroud)
解释一下计划edges
== Physical Plan ==
*(5) HashAggregate(keys=[datasetA#232, datasetB#263], functions=[])
+- Exchange hashpartitioning(datasetA#232, datasetB#263, 200)
+- *(4) HashAggregate(keys=[datasetA#232, datasetB#263], functions=[])
+- *(4) Project [datasetA#232, datasetB#263]
+- *(4) BroadcastHashJoin [entry#233, hashValue#234], [entry#264, hashValue#265], Inner, BuildRight, (UDF(datasetA#232.vectorized_char_lst, datasetB#263.vectorized_char_lst) < 0.3)
:- *(4) Project [named_struct(id, id#10, name, name#11, clean, clean#90, char_lst, char_lst#95, vectorized_char_lst, vectorized_char_lst#107, hashes, hashes#225) AS datasetA#232, entry#233, hashValue#234]
: +- *(4) Filter isnotnull(hashValue#234)
: +- Generate posexplode(hashes#225), [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, hashes#225], false, [entry#233, hashValue#234]
: +- *(1) Project [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, UDF(vectorized_char_lst#107) AS hashes#225]
: +- InMemoryTableScan [char_lst#95, clean#90, id#10, name#11, vectorized_char_lst#107]
: +- InMemoryRelation [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *(4) Project [id#10, name#11, pythonUDF0#114 AS clean#90, pythonUDF2#116 AS char_lst#95, UDF(pythonUDF2#116) AS vectorized_char_lst#107]
: +- BatchEvalPython [<lambda>(name#11), <lambda>(<lambda>(name#11)), <lambda>(<lambda>(name#11))], [id#10, name#11, pythonUDF0#114, pythonUDF1#115, pythonUDF2#116]
: +- SortAggregate(key=[name#11], functions=[first(id#10, false)])
: +- *(3) Sort [name#11 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(name#11, 200)
: +- SortAggregate(key=[name#11], functions=[partial_first(id#10, false)])
: +- *(2) Sort [name#11 ASC NULLS FIRST], false, 0
: +- Exchange RoundRobinPartitioning(8)
: +- *(1) Filter AtLeastNNulls(n, id#10,name#11)
: +- *(1) FileScan csv [id#10,name#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:<path>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, int, false], input[2, vector, true]))
+- *(3) Project [named_struct(id, id#10, name, name#11, clean, clean#90, char_lst, char_lst#95, vectorized_char_lst, vectorized_char_lst#107, hashes, hashes#256) AS datasetB#263, entry#264, hashValue#265]
+- *(3) Filter isnotnull(hashValue#265)
+- Generate posexplode(hashes#256), [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, hashes#256], false, [entry#264, hashValue#265]
+- *(2) Project [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, UDF(vectorized_char_lst#107) AS hashes#256]
+- InMemoryTableScan [char_lst#95, clean#90, id#10, name#11, vectorized_char_lst#107]
+- InMemoryRelation [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(4) Project [id#10, name#11, pythonUDF0#114 AS clean#90, pythonUDF2#116 AS char_lst#95, UDF(pythonUDF2#116) AS vectorized_char_lst#107]
+- BatchEvalPython [<lambda>(name#11), <lambda>(<lambda>(name#11)), <lambda>(<lambda>(name#11))], [id#10, name#11, pythonUDF0#114, pythonUDF1#115, pythonUDF2#116]
+- SortAggregate(key=[name#11], functions=[first(id#10, false)])
+- *(3) Sort [name#11 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#11, 200)
+- SortAggregate(key=[name#11], functions=[partial_first(id#10, false)])
+- *(2) Sort [name#11 ASC NULLS FIRST], false, 0
+- Exchange RoundRobinPartitioning(8)
+- *(1) Filter AtLeastNNulls(n, id#10,name#11)
+- *(1) FileScan csv [id#10,name#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:<path>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
Run Code Online (Sandbox Code Playgroud)
看起来怎么样data:
+-------+--------------------+--------------------+--------------------+--------------------+
| id| name| clean| char_lst| vectorized_char_lst|
+-------+--------------------+--------------------+--------------------+--------------------+
|3633038|MURATA MACHINERY LTD| MURATA MACHINERY|[M, U, R, A, T, A...|(33,[0,1,2,3,4,5,...|
|3632811|SOCIETE ANONYME D...|SOCIETE ANONYME D...|[S, O, C, I, E, T...|(33,[0,1,2,3,4,5,...|
|3632655|FUJIFILM CORPORATION| FUJIFILM|[F, U, J, I, F, I...|(33,[3,10,12,13,2...|
|3633318|HEINE OPTOTECHNIK...|HEINE OPTOTECHNIK...|[H, E, I, N, E, ...|(33,[0,1,2,3,4,5,...|
|3633523|SUNBEAM PRODUCTS INC| SUNBEAM PRODUCTS|[S, U, N, B, E, A...|(33,[0,1,2,4,5,6,...|
|3633300| HIVAL LTD| HIVAL| [H, I, V, A, L]|(33,[2,3,10,11,21...|
|3632657| NSK LTD| NSK| [N, S, K]|(33,[5,6,16],[1.0...|
|3633240|REHABILITATION IN...|REHABILITATION IN...|[R, E, H, A, B, I...|(33,[0,1,2,3,4,5,...|
|3632732|STUDIENGESELLSCHA...|STUDIENGESELLSCHA...|[S, T, U, D, I, E...|(33,[0,1,2,3,4,5,...|
|3632866|ENERGY CONVERSION...|ENERGY CONVERSION...|[E, N, E, R, G, Y...|(33,[0,1,3,5,6,7,...|
|3632895|ERGENICS POWER SY...|ERGENICS POWER SY...|[E, R, G, E, N, I...|(33,[0,1,3,4,5,6,...|
|3632897| MOLI ENERGY LIMITED| MOLI ENERGY|[M, O, L, I, , E...|(33,[0,1,3,5,7,8,...|
|3633275| NORDSON CORPORATION| NORDSON|[N, O, R, D, S, O...|(33,[5,6,7,8,14],...|
|3633256| PEROXIDCHEMIE GMBH| PEROXIDCHEMIE|[P, E, R, O, X, I...|(33,[0,3,7,8,9,11...|
|3632695| POWER CELL INC| POWER CELL|[P, O, W, E, R, ...|(33,[0,1,7,8,9,10...|
|3633037| ERGENICS INC| ERGENICS|[E, R, G, E, N, I...|(33,[0,3,5,6,8,9,...|
|3632878| FORD MOTOR COMPANY| FORD MOTOR|[F, O, R, D, , M...|(33,[1,4,7,8,13,1...|
|3632573| SAFT AMERICA INC| SAFT AMERICA|[S, A, F, T, , A...|(33,[0,1,2,3,4,6,...|
|3632852|ALCAN INTERNATION...| ALCAN INTERNATIONAL|[A, L, C, A, N, ...|(33,[0,1,2,3,4,5,...|
|3632698| KRUPPKOPPERS GMBH| KRUPPKOPPERS|[K, R, U, P, P, K...|(33,[0,6,7,8,12,1...|
|3633150|ALCAN INTERNATION...| ALCAN INTERNATIONAL|[A, L, C, A, N, ...|(33,[0,1,2,3,4,5,...|
|3632761|AMERICAN TELEPHON...|AMERICAN TELEPHON...|[A, M, E, R, I, C...|(33,[0,1,2,3,4,5,...|
|3632757|HITACHI KOKI COMP...| HITACHI KOKI|[H, I, T, A, C, H...|(33,[1,2,3,4,7,9,...|
|3632836|HUGHES AIRCRAFT C...| HUGHES AIRCRAFT|[H, U, G, H, E, S...|(33,[0,1,2,3,4,6,...|
|3633152| SOSY INC| SOSY| [S, O, S, Y]|(33,[6,7,18],[2.0...|
|3633052|HAMAMATSU PHOTONI...|HAMAMATSU PHOTONI...|[H, A, M, A, M, A...|(33,[1,2,3,4,5,6,...|
|3633450| AKZO NOBEL NV| AKZO NOBEL|[A, K, Z, O, , N...|(33,[0,1,2,5,7,10...|
|3632713| ELTRON RESEARCH INC| ELTRON RESEARCH|[E, L, T, R, O, N...|(33,[0,1,2,4,5,6,...|
|3632533|NEC ELECTRONICS C...| NEC ELECTRONICS|[N, E, C, , E, L...|(33,[0,1,3,4,5,6,...|
|3632562| TARGETTI SANKEY SPA| TARGETTI SANKEY SPA|[T, A, R, G, E, T...|(33,[0,1,2,3,4,5,...|
+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 30 rows
Run Code Online (Sandbox Code Playgroud)
使用的硬件:
使用的 Spark 提交设置:
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.shuffle.partitions=2000" --conf "spark.executor.cores=4" --conf "spark.executor.memory=14g" --conf "spark.driver.memory=14g" --conf "spark.driver.maxResultSize=14g" --conf "spark.dynamicAllocation.enabled=false" --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 run_disambiguation.py
Run Code Online (Sandbox Code Playgroud)
Web UI 中的任务错误
ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Slave lost
Run Code Online (Sandbox Code Playgroud)
ExecutorLostFailure (executor 31 exited unrelated to the running tasks) Reason: Container marked as failed: container_1590592506722_0001_02_000002 on host: ip-172-31-47-180.eu-central-1.compute.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
Run Code Online (Sandbox Code Playgroud)
(部分)执行者日志:
20/05/27 16:29:09 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (25 times so far)
20/05/27 16:29:13 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (26 times so far)
20/05/27 16:29:15 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (28 times so far)
20/05/27 16:29:17 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (0 time so far)
20/05/27 16:29:28 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (27 times so far)
20/05/27 16:29:28 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (26 times so far)
20/05/27 16:29:33 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (29 times so far)
20/05/27 16:29:38 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (1 time so far)
20/05/27 16:29:42 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (27 times so far)
20/05/27 16:29:46 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (28 times so far)
20/05/27 16:29:53 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (30 times so far)
20/05/27 16:29:57 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (2 times so far)
20/05/27 16:30:00 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (28 times so far)
20/05/27 16:30:05 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (29 times so far)
20/05/27 16:30:10 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (31 times so far)
20/05/27 16:30:15 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (3 times so far)
20/05/27 16:30:19 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (29 times so far)
20/05/27 16:30:22 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (30 times so far)
20/05/27 16:30:29 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (32 times so far)
20/05/27 16:30:32 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (4 times so far)
20/05/27 16:30:39 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (31 times so far)
20/05/27 16:30:39 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (30 times so far)
20/05/27 16:30:46 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (33 times so far)
20/05/27 16:30:47 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (5 times so far)
20/05/27 16:30:55 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (32 times so far)
20/05/27 16:30:59 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (31 times so far)
20/05/27 16:31:03 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (34 times so far)
20/05/27 16:31:06 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (6 times so far)
20/05/27 16:31:13 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (33 times so far)
20/05/27 16:31:14 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (32 times so far)
20/05/27 16:31:22 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (35 times so far)
20/05/27 16:31:24 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (7 times so far)
20/05/27 16:31:30 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (34 times so far)
20/05/27 16:31:32 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (33 times so far)
20/05/27 16:31:41 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (36 times so far)
20/05/27 16:31:44 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (8 times so far)
20/05/27 16:31:47 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (35 times so far)
20/05/27 16:31:48 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (34 times so far)
20/05/27 16:32:02 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (37 times so far)
20/05/27 16:32:03 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (9 times so far)
20/05/27 16:32:04 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (36 times so far)
20/05/27 16:32:08 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (35 times so far)
20/05/27 16:32:19 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (38 times so far)
20/05/27 16:32:20 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (37 times so far)
20/05/27 16:32:21 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (10 times so far)
20/05/27 16:32:26 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (36 times so far)
20/05/27 16:32:37 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (39 times so far)
20/05/27 16:32:37 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (11 times so far)
20/05/27 16:32:38 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (38 times so far)
20/05/27 16:32:45 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (37 times so far)
20/05/27 16:32:51 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (40 times so far)
20/05/27 16:32:56 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (12 times so far)
20/05/27 16:32:58 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (39 times so far)
20/05/27 16:33:03 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (38 times so far)
20/05/27 16:33:08 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (41 times so far)
20/05/27 16:33:13 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (13 times so far)
20/05/27 16:33:15 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (40 times so far)
20/05/27 16:33:20 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (39 times so far)
20/05/27 16:33:26 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (42 times so far)
20/05/27 16:33:30 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (41 times so far)
20/05/27 16:33:31 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (14 times so far)
20/05/27 16:33:36 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (40 times so far)
20/05/27 16:33:46 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1992.0 MB to disk (43 times so far)
20/05/27 16:33:47 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (42 times so far)
20/05/27 16:33:51 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (15 times so far)
20/05/27 16:33:54 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (41 times so far)
20/05/27 16:34:03 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1992.0 MB to disk (43 times so far)
20/05/27 16:34:04 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1992.0 MB to disk (44 times so far)
20/05/27 16:34:08 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (16 times so far)
20/05/27 16:34:14 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (42 times so far)
20/05/27 16:34:16 INFO PythonUDFRunner: Times: total = 774701, boot = 3, init = 10, finish = 774688
20/05/27 16:34:21 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1992.0 MB to disk (44 times so far)
20/05/27 16:34:22 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (17 times so far)
20/05/27 16:34:30 INFO PythonUDFRunner: Times: total = 773372, boot = 2, init = 9, finish = 773361
20/05/27 16:34:32 INFO ShuffleExternalSorter:
@lokk3r 的回答确实帮助我找到了正确的方向。然而,在我能够无错误地运行该程序之前,我还必须做一些其他事情。我将分享它们来帮助遇到类似问题的人:
NGrams按照 @lokk3r 的建议使用而不是仅使用单个字符,以避免 MinHashLSH 算法内出现极端的数据偏差。当使用 4-grams 时,data看起来像:+------------------------------+-------+------------------------------+------------------------------+------------------------------+
| name| id| clean| ng_char_lst| vectorized_char_lst|
+------------------------------+-------+------------------------------+------------------------------+------------------------------+
| SOCIETE ANONYME DITE SAFT|3632811| SOCIETE ANONYME DITE SAFT|[ S O C, S O C I, O C I E,...|(1332,[64,75,82,84,121,223,...|
| MURATA MACHINERY LTD|3633038| MURATA MACHINERY|[ M U R, M U R A, U R A T,...|(1332,[55,315,388,437,526,5...|
|HEINE OPTOTECHNIK GMBH AND ...|3633318| HEINE OPTOTECHNIK GMBH AND|[ H E I, H E I N, E I N E,...|(1332,[23,72,216,221,229,34...|
| FUJIFILM CORPORATION|3632655| FUJIFILM|[ F U J, F U J I, U J I F,...|(1332,[157,179,882,1028],[1...|
| SUNBEAM PRODUCTS INC|3633523| SUNBEAM PRODUCTS|[ S U N, S U N B, U N B E,...|(1332,[99,137,165,175,187,1...|
| STUDIENGESELLSCHAFT KOHLE MBH|3632732| STUDIENGESELLSCHAFT KOHLE MBH|[ S T U, S T U D, T U D I,...|(1332,[13,14,23,25,43,52,57...|
|REHABILITATION INSTITUTE OF...|3633240|REHABILITATION INSTITUTE OF...|[ R E H, R E H A, E H A B,...|(1332,[20,44,51,118,308,309...|
| NORDSON CORPORATION|3633275| NORDSON|[ N O R, N O R D, O R D S,...|(1332,[45,88,582,1282],[1.0...|
| ENERGY CONVERSION DEVICES|3632866| ENERGY CONVERSION DEVICES|[ E N E, E N E R, N E R G,...|(1332,[54,76,81,147,202,224...|
| MOLI ENERGY LIMITED|3632897| MOLI ENERGY|[ M O L, M O L I, O L I ,...|(1332,[438,495,717,756,1057...|
| ERGENICS POWER SYSTEMS INC|3632895| ERGENICS POWER SYSTEMS|[ E R G, E R G E, R G E N,...|(1332,[6,10,18,21,24,35,375...|
| POWER CELL INC|3632695| POWER CELL|[ P O W, P O W E, O W E R,...|(1332,[6,10,18,35,126,169,3...|
| PEROXIDCHEMIE GMBH|3633256| PEROXIDCHEMIE|[ P E R, P E R O, E R O X,...|(1332,[326,450,532,889,1073...|
| FORD MOTOR COMPANY|3632878| FORD MOTOR|[ F O R, F O R D, O R D ,...|(1332,[156,158,186,200,314,...|
| ERGENICS INC|3633037| ERGENICS|[ E R G, E R G E, R G E N,...|(1332,[375,642,812,866,1269...|
| SAFT AMERICA INC|3632573| SAFT AMERICA|[ S A F, S A F T, A F T ,...|(1332,[498,552,1116],[1.0,1...|
| ALCAN INTERNATIONAL LIMITED|3632598| ALCAN INTERNATIONAL|[ A L C, A L C A, L C A N,...|(1332,[20,434,528,549,571,7...|
| KRUPPKOPPERS GMBH|3632698| KRUPPKOPPERS|[ K R U, K R U P, R U P P,...|(1332,[664,795,798,1010,114...|
| HUGHES AIRCRAFT COMPANY|3632752| HUGHES AIRCRAFT|[ H U G, H U G H, U G H E,...|(1332,[605,632,705,758,807,...|
|AMERICAN TELEPHONE AND TELE...|3632761|AMERICAN TELEPHONE AND TELE...|[ A M E, A M E R, M E R I,...|(1332,[19,86,91,126,128,134...|
+------------------------------+-------+------------------------------+------------------------------+------------------------------+
Run Code Online (Sandbox Code Playgroud)
请注意,我在名称上添加了前导和尾随空格,以确保名称中的单词顺序对于NGrams: 'XX YY'has 3-grams'XX ', 'X Y', ' YY'和 while 'YY XX'has 3-grams来说并不重要'YY ', 'Y X', ' XX'。这意味着两者共享 6 个唯一值中的 0 个NGrams。如果我们使用前导和尾随空格:' XX YY 'has 3-grams ' XX', 'XX ', 'X Y', ' YY', 'YY ',而' YY XX 'has 3-grams ' YY', 'YY ', 'Y X', ' XX', 'XX '。这意味着两者共有 6 个中的 4 个是唯一的NGrams。这意味着在 MinHashLSH 期间两条记录在同一个存储桶中结束的可能性要大得多。
我尝试了n- 输入参数的不同值NGrams。我发现 和n=2仍然n=3给出了如此多的数据偏差,以至于一些 Spark 作业花费了太长时间,而其他作业则在几秒钟内完成。所以你最终会在程序继续之前等待很久。我现在使用n=4,这仍然会产生很大的偏差,但它是可行的。
为了进一步减少数据倾斜的影响,我NGrams在CountVectorizerSpark 的方法中使用了一些过于(不)频繁发生的附加过滤。我已进行设置minDF=2,使其过滤掉NGrams仅出现在单个名称中的内容。我这样做是因为你无法根据NGram仅出现在一个名称中的 a 来匹配这些名称。此外,我还进行了设置maxDF=0.001,以过滤掉NGrams出现在超过 0.1% 的名称中的内容。这意味着大约 3000 万个名称NGrams中出现频率高于 30000 个名称的名称将被过滤掉。我认为出现得太频繁NGram不会提供关于哪些名称可以匹配的有用信息。
我通过过滤掉非拉丁(扩展)名称,将唯一名称的数量(首先是 3000 万)减少到 1500 万。我注意到(例如阿拉伯语和中文)字符也导致数据出现很大偏差。由于我主要对消除这些公司名称的歧义不感兴趣,因此我从数据集中忽略了它们。我使用以下正则表达式匹配进行过滤:
re.fullmatch('[\u0020-\u007F\u00A0-\u00FF\u0100-\u017F\u0180-\u024F]+'.encode(), string_to_filter.encode())
Run Code Online (Sandbox Code Playgroud)
这是一个有点直接的建议,但我因为没有看到它而遇到了一些问题。确保在将数据集提供给算法之前对数据集运行过滤器,以过滤掉由于设置或仅仅因为名称较小而MinHashLSH没有剩余的记录。显然这不适用于该算法。NGramsminDFmaxDFMinHashLSH
最后,关于命令的设置spark-submit和EMR集群的硬件设置,我发现我不需要像论坛上的一些答案所建议的那样更大的集群。所有上述更改使程序在集群上完美运行,并使用我原始帖子中提供的设置。减少spark.shuffle.partitions、spark.driver.memory和spark.driver.maxResultSize大大提高了程序的运行时间。我提交的内容spark-submit是:
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=8g" --conf "spark.driver.maxResultSize=8g" --conf "spark.dynamicAllocation.enabled=false" --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 run_disambiguation.py
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2096 次 |
| 最近记录: |