Poe*_*dit 0 python performance time for-loop pandas
我有一个python有关以下内容的脚本:
pandas数据框pandas数据框问题在于,每次迭代处理时间都在增加。特别:
0-1000 documents -> 5 seconds
1000-2000 documents -> 6 seconds
2000-3000 documents -> 7 seconds
...
10000-11000 documents -> 18 seconds
11000-12000 documents -> 19 seconds
...
22000-23000 documents -> 39 seconds
23000-24000 documents -> 42 seconds
...
34000-35000 documents -> 69 seconds
35000-36000 documents -> 72 seconds
Run Code Online (Sandbox Code Playgroud)
为什么会这样呢?
我的代码如下所示:
# 'documents' is the list of jsons
columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
df_documents = pd.DataFrame(columns=columns)
for index, document in enumerate(documents):
dict_document = dict.fromkeys(columns)
...
(parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
...
df_documents = df_documents.append(dict_document, ignore_index=True)
Run Code Online (Sandbox Code Playgroud)
聚苯乙烯
在应用@eumiro的建议后,以下时间如下:
0-1000 documents -> 0.06 seconds
1000-2000 documents -> 0.05 seconds
2000-3000 documents -> 0.05 seconds
...
10000-11000 documents -> 0.05 seconds
11000-12000 documents -> 0.05 seconds
...
22000-23000 documents -> 0.05 seconds
23000-24000 documents -> 0.05 seconds
...
34000-35000 documents -> 0.05 seconds
35000-36000 documents -> 0.05 seconds
Run Code Online (Sandbox Code Playgroud)
在应用@DariuszKrynicki的建议后,以下时间如下:
0-1000 documents -> 0.56 seconds
1000-2000 documents -> 0.54 seconds
2000-3000 documents -> 0.53 seconds
...
10000-11000 documents -> 0.51 seconds
11000-12000 documents -> 0.51 seconds
...
22000-23000 documents -> 0.51 seconds
23000-24000 documents -> 0.51 seconds
...
34000-35000 documents -> 0.51 seconds
35000-36000 documents -> 0.51 seconds
...
Run Code Online (Sandbox Code Playgroud)
是的,append在每行之后添加到DataFrame的速度都会变慢,因为它必须一次又一次地复制整个(增长的)内容。
创建一个简单的列表,追加到该列表,然后一步创建一个DataFrame:
records = []
for index, document in enumerate(documents):
…
records.append(dict_document)
df_documents = pd.DataFrame.from_records(records)
Run Code Online (Sandbox Code Playgroud)