我有大小2个JSON文件data_large(150.1mb)和data_small(7.5kb)。每个文件中的内容都是[{"score": 68},{"score": 78}]. 我需要从每个文件中找到唯一分数的列表。
在处理data_small 时,我执行了以下操作,并且能够使用0.1 secs.
with open('data_small') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
Run Code Online (Sandbox Code Playgroud)
但是在处理data_large 时,我做了以下事情,我的系统被挂起,速度很慢,不得不强制关闭它以使其恢复正常速度。2 mins 打印其内容需要花费一些时间。
with open('data_large') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
Run Code Online (Sandbox Code Playgroud)
如何在处理大型数据集时提高程序的效率?
由于您的 json 文件不是那么大,并且您可以一次性将其打开到 ram 中,因此您可以获得所有唯一值,例如:
with open('data_large') as f:
content = json.load(f)
# do not print content since it prints it to stdout which will be pretty slow
# get the unique values
values = set()
for item in content:
values.add(item['score'])
# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])
# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
# json cant serialize sets hence conversion to list
json.dump(list(values), fid)
Run Code Online (Sandbox Code Playgroud)
如果您需要处理更大的文件,请寻找可以迭代解析 json 文件的其他库。