Luk*_*uke 5 python mysql record-linkage python-dedupe
我正在尝试使用该Dedupe包将一个小杂乱的数据合并到一个规范表.由于规范表非常大(1.22亿行),我无法将其全部加载到内存中.
目前的做法,我使用基于关闭这个发生在测试数据一整天:存储在一个字典凌乱数据的300K行的表,并将其存储在MySQL的规范数据的60万行的表.如果我在内存中完成所有操作(作为dict读取规范表),它只需要半个小时.
有没有办法让这个更有效率?
blocked_pairs = block_data(messy_data, canonical_db_cursor, gazetteer)
clustered_dupes = gazetteer.matchBlocks(blocked_pairs, 0)
def block_data(messy_data, c, gazetteer):
block_groups = itertools.groupby(gazetteer.blocker(messy_data.viewitems()),
lambda x: x[1])
for (record_id, block_keys) in block_groups:
a = [(record_id, messy_data[record_id], set())]
c.execute("""SELECT *
FROM canonical_table
WHERE record_id IN
(SELECT DISTINCT record_id
FROM blocking_map
WHERE block_key IN %s)""",
(tuple(block_key for block_key, _ in block_keys),))
b = [(row[self.key], row, set()) for row in c]
if b:
yield (a, b)
Run Code Online (Sandbox Code Playgroud)
通过将查询分成两个查询,可以显着加快速度。我正在使用mysql并且示例中使用的所有列都已索引...
def block_data(messy_data, c, gazetteer):
block_groups = itertools.groupby(gazetteer.blocker(messy_data.viewitems()),
lambda x: x[1])
for (record_id, block_keys) in block_groups:
a = [(record_id, messy_data[record_id], set())]
c.execute("""SELECT DISTINCT record_id
FROM blocking_map
WHERE block_key IN %s""",
(tuple(block_key for block_key, _ in block_keys),))
values = tuple(row['record_id'] for row in c)
if values:
c.execute("""SELECT *
FROM canonical_table
WHERE record_id IN %s""",
(values,))
b = [(row['record_id'], row, set())
for row in c]
if b:
yield (a, b)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
413 次 |
| 最近记录: |