ner*_*erd 1 python python-3.x fuzzywuzzy rapidfuzz
我想在包含 200,000 个元素的列表上运行本文中提到的这段rapidfuzz 代码。我想知道优化它以在 GPU 上更快运行的最佳方法是什么?
import pandas as pd
from rapidfuzz import fuzz
elements = ['vikash', 'vikas', 'Vinod', 'Vikky', 'Akash', 'Vinodh', 'Sachin', 'Salman', 'Ajay', 'Suchin', 'Akash', 'vikahs']
results = [[name, [], 0] for name in elements]
for (i, element) in enumerate(elements):
for (j, choice) in enumerate(elements[i+1:]):
if fuzz.ratio(element, choice, score_cutoff=90):
results[i][2] += 1
results[i][1].append(choice)
results[j+i+1][2] += 1
results[j+i+1][1].append(element)
data = pd.DataFrame(results, columns=['name', 'duplicates', 'duplicate_count'])
Run Code Online (Sandbox Code Playgroud)
预期输出 -
name duplicates duplicate_count
0 vikash [vikas] 1
1 vikas [vikash, vikahs] 2
2 Vinod [Vinodh] 1
3 Vikky [] 0
4 Akash [Akash] 1
5 Vinodh [Vinod] 1
6 Sachin [] 0
7 Salman [] 0
8 Ajay [] 0
9 Suchin [] 0
10 Akash [Akash] 1
11 vikahs [vikas] 1
Run Code Online (Sandbox Code Playgroud)
这rapidfuzz库具有加速功能,可以利用CPU的并行处理能力来加快处理速度。
该workers参数启用并行处理。使用该值workers=-1,您将使用所有可用的核心。
from rapidfuzz.process import cdist
# Calculate distance between all the names
sa = cdist(elements, elements, score_cutoff=90, workers=-1)
duplicates_list = []
for distances in sa:
# Get indices of duplicates
indices = np.argwhere(~np.isin(distances, [100, 0])).flatten()
# Get names from indices
names = list(map(elements.__getitem__, indices))
duplicates_list.append(names)
# Create dataframe using the data
df = pd.DataFrame({'name': elements, 'duplicates': duplicates_list})
df['duplicate_count'] = df.duplicates.str.len()
Run Code Online (Sandbox Code Playgroud)
输出
name duplicates duplicate_count
0 vikash [vikas] 1
1 vikas [vikash, vikahs] 2
2 Vinod [Vinodh] 1
3 Vikky [] 0
4 Akash [] 0
5 Vinodh [Vinod] 1
6 Sachin [] 0
7 Salman [] 0
8 Ajay [] 0
9 Suchin [] 0
10 Akash [] 0
11 vikahs [vikas] 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2566 次 |
| 最近记录: |