Вла*_*мир 8 python pandas dask vaex
我有 2 个文本文件 (*.txt),其中包含以下格式的唯一字符串:
udtvbacfbbxfdffzpwsqzxyznecbqxgebuudzgzn:refmfxaawuuilznjrxuogrjqhlmhslkmprdxbascpoxda
ltswbjfsnejkaxyzwyjyfggjynndwkivegqdarjg:qyktyzugbgclpovyvmgtkihxqisuawesmcvsjzukcbrzi
Run Code Online (Sandbox Code Playgroud)
第一个文件包含5000 万行(4.3 GB),第二个包含100 万行(112 MB)。一行包含 40 个字符、分隔符 : 和另外 45 个字符。
任务:获取两个文件的唯一值。也就是说,您需要一个csv 或 txt文件,其中的行在第二个文件中而不在第一个文件中。
我正在尝试使用vaex ( Vaex )来做到这一点:
import vaex
base_files = ['file1.txt']
for i, txt_file in enumerate(base_files, 1):
for j, dv in enumerate(vaex.from_csv(txt_file, chunk_size=5_000_000, names=['data']), 1):
dv.export_hdf5(f'hdf5_base/base_{i:02}_{j:02}.hdf5')
check_files = ['file2.txt']
for i, txt_file in enumerate(check_files, 1):
for j, dv in enumerate(vaex.from_csv(txt_file, chunk_size=5_000_000, names=['data']), 1):
dv.export_hdf5(f'hdf5_check/check_{i:02}_{j:02}.hdf5')
dv_base = vaex.open('hdf5_base/*.hdf5')
dv_check = vaex.open('hdf5_check/*.hdf5')
dv_result = dv_check.join(dv_base, on='data', how='inner', inplace=True)
dv_result.export(path='result.csv')
Run Code Online (Sandbox Code Playgroud)
结果,我得到了具有唯一行值的result.csv文件。但是验证过程需要很长时间。此外,它使用所有可用的 RAM 和所有处理器资源。如何加速这个过程?我究竟做错了什么?什么可以做得更好?是否值得为此检查使用其他库(pandas、dask),它们会更快吗?
UPD 10.11.2020 到目前为止,我还没有发现比以下选项更快的东西:
from io import StringIO
def read_lines(filename):
handle = StringIO(filename)
for line in handle:
yield line.rstrip('\n')
def read_in_chunks(file_obj, chunk_size=10485760):
while True:
data = file_obj.read(chunk_size)
if not data:
break
yield data
file_check = open('check.txt', 'r', errors='ignore').read()
check_set = {elem for elem in read_lines(file_check)}
with open(file='base.txt', mode='r', errors='ignore') as file_base:
for idx, chunk in enumerate(read_in_chunks(file_base), 1):
print(f'Checked [{idx}0 Mb]')
for elem in read_lines(chunk):
if elem in check_set:
check_set.remove(elem)
print(f'Unique rows: [{len(check_set)}]')
Run Code Online (Sandbox Code Playgroud)
UPD 11.11.2020:感谢@m9_psy 提供提高性能的提示。它真的更快!目前,最快的方法是:
from io import BytesIO
check_set = {elem for elem in BytesIO(open('check.txt', 'rb').read())}
with open('base.txt', 'rb') as file_base:
for line in file_base:
if line in check_set:
check_set.remove(line)
print(f'Unique rows: [{len(check_set)}]')
Run Code Online (Sandbox Code Playgroud)
有没有办法进一步加快这个过程?
我怀疑该join操作需要n * m比较操作,其中n和m是两个数据帧的长度。
此外,您的描述和代码之间存在不一致:
dv_check但不是在dv_basedv_check.join(dv_base, on='data', how='inner', inplace=True)? 这意味着在dv_check和dv_base无论如何,一个想法是使用,set因为检查集合中的成员资格的时间复杂度为 ,O(1)而检查列表中的成员资格的时间复杂度为O(n)。如果您熟悉 SQL 世界,这相当于从 LOOP JOIN 策略移动到 HASH JOIN 策略:
# This will take care of removing the duplicates
base_set = set(dv_base['data'])
check_set = set(dv_check['data'])
# In `dv_check` but not `dv_base`
keys = check_set - base_set
# In both `dv_check` and `dv_base`
keys = check_set & base_set
Run Code Online (Sandbox Code Playgroud)
这只会为您提供满足您条件的密钥。您仍然需要过滤两个数据框以获取其他属性。
在 16GB 内存的 2014 iMac 上用时 1 分 14 秒完成。
让我们生成一个数据集来模仿您的示例
import vaex
import numpy as np
N = 50_000_000 # 50 million rows for base
N2 = 1_000_000 # 1 million for check
M = 40+1+45 # chars for each string
N_dup = 10_000 # number of duplicate rows in the checks
s1 = np.random.randint(ord('a'), ord('z'), (N, M), np.uint32).view(f'U{M}').reshape(N)
s2 = np.random.randint(ord('a'), ord('z'), (N2, M), np.uint32).view(f'U{M}').reshape(N2)
# make sure s2 has rows that match s1
dups = np.random.choice(N2, N_dup, replace=False)
s2[dups] = s1[np.random.choice(N, N_dup, replace=False)]
# save the data to disk
vaex.from_arrays(s=s1).export('/data/tmp/base.hdf5')
vaex.from_arrays(s=s2).export('/data/tmp/check.hdf5')
Run Code Online (Sandbox Code Playgroud)
现在,要查找检查中不在基数中的行,我们可以加入它们,并删除不匹配的行:
import vaex
base = vaex.open('/data/tmp/base.hdf5')
check = vaex.open('/data/tmp/check.hdf5')
# joined contains rows where s_other is missing
joined = check.join(base, on='s', how='left', rsuffix='_other')
# drop those
unique = joined.dropmissing(['s_other'])
# and we have everything left
unique
# s s_other
0 'hvxursyijiehidlmtqwpfawtuwlmflvwwdokmuvxqyujfh... 'hvxursyijiehidlmtqwpfawtuwlmflvwwdokmuvxqyujfhb...
1 'nslxohrqydxyugngxhvtjwptjtsyuwaljdnprwfjnssikh... 'nslxohrqydxyugngxhvtjwptjtsyuwaljdnprwfjnssikhh...
2 'poevcdxjirulnktmvifdbdaonjwiellqrgnxhbolnjhact... 'poevcdxjirulnktmvifdbdaonjwiellqrgnxhbolnjhactn...
3 'xghcphcvwswlsywgcrrwxglnhwtlpbhlnqhjgsmpivghjk... 'xghcphcvwswlsywgcrrwxglnhwtlpbhlnqhjgsmpivghjku...
4 'gwmkxxqkrfjobkpciqpdahdeuqfenrorqrwajuqdgluwvb... 'gwmkxxqkrfjobkpciqpdahdeuqfenrorqrwajuqdgluwvbs...
... ... ...
9,995 'uukjkyaxbjqvmwscnhewxpdgwrhosipoelbhsdnbpjxiwn... 'uukjkyaxbjqvmwscnhewxpdgwrhosipoelbhsdnbpjxiwno...
9,996 'figbmhruheicxkmuqbbnuavgabdlvxxjfudavspdncogms... 'figbmhruheicxkmuqbbnuavgabdlvxxjfudavspdncogmsb...
9,997 'wwgykvwckqqttxslahcojcplnxrjsijupswcyekxooknji... 'wwgykvwckqqttxslahcojcplnxrjsijupswcyekxooknjii...
9,998 'yfopgcfpedonpgbeatweqgweibdesqkgrxwwsikilvvvmv... 'yfopgcfpedonpgbeatweqgweibdesqkgrxwwsikilvvvmvo...
9,999 'qkavooownqwtpbeqketbvpcvxlliptitespfqkcecidfeb... 'qkavooownqwtpbeqketbvpcvxlliptitespfqkcecidfebi...
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
984 次 |
| 最近记录: |