如何在大型 JSON 文件中查找唯一值？

Question

如何在大型 JSON 文件中查找唯一值？

我有大小2个JSON文件data_large(150.1mb)和data_small(7.5kb)。每个文件中的内容都是[{"score": 68},{"score": 78}]. 我需要从每个文件中找到唯一分数的列表。

在处理data_small 时，我执行了以下操作，并且能够使用0.1 secs.

with open('data_small') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

Run Code Online (Sandbox Code Playgroud)

但是在处理data_large 时，我做了以下事情，我的系统被挂起，速度很慢，不得不强制关闭它以使其恢复正常速度。2 mins 打印其内容需要花费一些时间。

with open('data_large') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

Run Code Online (Sandbox Code Playgroud)

如何在处理大型数据集时提高程序的效率？

Answer 1

mik*_*725 6

由于您的 json 文件不是那么大，并且您可以一次性将其打开到 ram 中，因此您可以获得所有唯一值，例如：

with open('data_large') as f:
    content = json.load(f)

# do not print content since it prints it to stdout which will be pretty slow

# get the unique values
values = set()
for item in content:
    values.add(item['score'])

# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])

# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
    # json cant serialize sets hence conversion to list
    json.dump(list(values), fid)

Run Code Online (Sandbox Code Playgroud)

如果您需要处理更大的文件，请寻找可以迭代解析 json 文件的其他库。

归档时间：	12 年，1 月前
查看次数：	9804 次
最近记录：	12 年，1 月前