Rob*_*ler 6 python dictionary pandas
以下代码有效但需要运行得更快.该字典有~25K键,数据帧为~3M行.有没有办法产生相同的结果,但python代码将运行得更快?(没有多处理,处理速度会慢8倍).
miscdict={" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
df=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def parse_text(data):
for key, replacement in miscdict.items():
data['q1'] = data['q1'].str.replace( key, replacement )
return data
if __name__ == '__main__':
t1_1 = datetime.datetime.now()
p = multiprocessing.Pool(processes=8)
split_dfs = np.array_split(df,8)
pool_results = p.map(parse_text, split_dfs)
p.close()
p.join()
parts = pd.concat(pool_results, axis=0)
df = pd.concat([parts], axis=1)
t2_1 = datetime.datetime.now()
print("done"+ str(t2_1-t1_1))
Run Code Online (Sandbox Code Playgroud)
我测试了其中一些.@ A-Za-z的建议是一项重大改进,但它可能会更快地完成.
编辑:我重新运行测试,我预先计算了替换字典和数据帧(以及预编译的正则表达式).新的时间是:
数据生成和正则表达式编译包含在时间中的原始结果:
"测试你的代码我得到15秒,@ A-Za-z的代码给了8-9秒,而我自己的解决方案将它降低到6秒.它使用预编译的正则表达式.看到这个答案的结尾."
进口:
import pandas as pd
import re
import timeit
Run Code Online (Sandbox Code Playgroud)
你原来的代码:
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def org(printout=False):
def parse_text(data):
for key, replacement in miscdict.items():
data['q1'] = data['q1'].str.replace( key, replacement )
return data
data2 = parse_text(data)
if printout:
print(data2)
org(printout=True)
print(timeit.timeit(org, number=10000))
Run Code Online (Sandbox Code Playgroud)
这用了11.7秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
11.71043858179268
Run Code Online (Sandbox Code Playgroud)
用户@ A-Za-z的代码:
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt1(printout=False):
data['q1'].replace(miscdict, regex = True, inplace = True)
if printout:
print(data)
alt1(printout=True)
print(timeit.timeit(alt1, number=10000))
Run Code Online (Sandbox Code Playgroud)
这用了4.7秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
4.721581550644499
Run Code Online (Sandbox Code Playgroud)
用户@ piRSquared的代码:
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt2(printout=False):
# regex = True is added later because it doesn't work without it.
data = data.replace(miscdict, regex = True)
if printout:
print(data)
alt2(printout=True)
print(timeit.timeit(alt2, number=10000))
Run Code Online (Sandbox Code Playgroud)
这用了5.0秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
4.951810616074919
Run Code Online (Sandbox Code Playgroud)
miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
miscdict_comp = {re.compile(k): v for k, v in miscdict.items()}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt3(printout=False):
def parse_text(text):
for pattern, replacement in miscdict_comp.items():
text = pattern.sub(replacement, text)
return text
data["q1"] = data["q1"].apply(parse_text)
if printout:
print(data)
alt3(printout=True)
print(timeit.timeit(alt3, number=10000))
Run Code Online (Sandbox Code Playgroud)
这用了2.8秒:
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
2.810334940701157
Run Code Online (Sandbox Code Playgroud)
我们的想法是预编译您想要改变的模式.
我从这里得到了这个想法:https://jerel.co/blog/2011/12/using-python-for-super-fast-regex-search-and-replace
你不需要这里的循环,df.replace与regex = True一起完成工作,它将时间缩短了一半以上.
df['q1'].replace(miscdict, regex = True, inplace = True)
1000 loops, best of 3: 1.08 ms per loop
Run Code Online (Sandbox Code Playgroud)
得到你
q1
0 beer is ok
1 beer is not ok
2 beer was not available
3 Sierra Nevada is good
Run Code Online (Sandbox Code Playgroud)
将其与当前解决方案进行比较
for key, replacement in miscdict.items(): df['q1'] = df['q1'].str.replace( key, replacement )
100 loops, best of 3: 2.35 ms per loop
Run Code Online (Sandbox Code Playgroud)