在 Python 中读取数千个 JSON 文件的最快方法

Question

在 Python 中读取数千个 JSON 文件的最快方法

mea*_*ngs 7 python json ipython python-3.x

我有许多需要分析的 JSON 文件。我正在使用 iPython ( Python 3.5.2 | IPython 5.0.0)，将文件读入字典并将每个字典附加到列表中。

我的主要瓶颈是读取文件。有些文件较小，读取速度很快，但较大的文件会减慢我的速度。

这是一些示例代码（抱歉，我无法提供实际的数据文件）：

import json
import glob

def read_json_files(path_to_file):
    with open(path_to_file) as p:
        data = json.load(p)
        p.close()
    return data

def giant_list(json_files):
    data_list = []
    for f in json_files:
        data_list.append(read_json_files(f))
    return data_list

support_files = glob.glob('/Users/path/to/support_tickets_*.json')
small_file_test = giant_list(support_files)

event_files = glob.glob('/Users/path/to/google_analytics_data_*.json')
large_file_test = giant_list(event_files)

Run Code Online (Sandbox Code Playgroud)

支持票证的大小非常小——我见过的最大的是 6KB。所以，这段代码运行得非常快：

In [3]: len(support_files)
Out[3]: 5278

In [5]: %timeit giant_list(support_files)
1 loop, best of 3: 557 ms per loop

Run Code Online (Sandbox Code Playgroud)

但较大的文件肯定会减慢我的速度...这些事件文件每个可以达到约 2.5MB：

In [7]: len(event_files) # there will be a lot more of these soon :-/
Out[7]: 397

In [8]: %timeit giant_list(event_files)
1 loop, best of 3: 14.2 s per loop

Run Code Online (Sandbox Code Playgroud)

我研究了如何加快这个过程，并发现了这篇文章，但是，当使用 UltraJSON 时，时间稍差一些：

In [3]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

Run Code Online (Sandbox Code Playgroud)

SimpleJSON 也没有做得更好：

In [4]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

Run Code Online (Sandbox Code Playgroud)

非常感谢任何有关如何优化此代码以及更有效地将大量 JSON 文件读取到 Python 中的提示。

最后，这篇文章是我发现的最接近我的问题的一篇文章，但涉及一个巨大的 JSON 文件，而不是许多较小的文件。

Answer 1

Łuk*_*ski 7

使用列表理解来避免多次调整列表大小。

def giant_list(json_files):
    return [read_json_file(path) for path in json_files]

Run Code Online (Sandbox Code Playgroud)

您要关闭文件对象两次，只需执行一次（退出with文件时将自动关闭）

def read_json_file(path_to_file):
    with open(path_to_file) as p:
        return json.load(p)

Run Code Online (Sandbox Code Playgroud)

归根结底，您的问题是 I/O 限制，但这些更改会有所帮助。另外，我还想问——你真的必须同时把所有这些字典都存入内存吗？

归档时间：	8 年，11 月前
查看次数：	10011 次
最近记录：	8 年，11 月前