通过大文件解析的有效方法

Question

通过大文件解析的有效方法

我必须解析一个非常大的文件,修改其内容,并将其写入另一个文件.我现在拥有的文件与它可能的文件相比并不是那么大,但它仍然很大.

该文件为1.3 GB,包含大约700万行此格式:

8823192\t/home/pcastr/...

Run Code Online (Sandbox Code Playgroud)

\t标签字符在哪里.开头的数字是后面路径的表观大小.

我想要一个输出文件,其行如下所示(采用csv格式):

True,8823192,/home/pcastr/...

Run Code Online (Sandbox Code Playgroud)

第一个值是路径是否是目录.

目前,我的代码看起来像这样:

with open(filepath, "r") as open_file:
    while True:
        line = open_file.readline()
        if line == "":  # Checks for the end of the file
            break
        size = line.split("\t")[0]
        path = line.strip().split("\t")[1]
        is_dir = os.path.isdir(path)

        streamed_file.write(unicode("{isdir},{size},{path}\n".format(isdir=is_dir, size=size, path=path))

Run Code Online (Sandbox Code Playgroud)

需要注意的是,像这样的文件会变得非常大,所以我不仅需要快速解决方案,还需要内存高效的解决方案.我知道这两种品质之间通常存在权衡,

Answer 1

che*_*ner 7

最大的收益可能来自split每线只召唤一次

size, path = line.strip().split("\t")
# or ...split("\t", 3)[0:2] if there are extra fields to ignore

Run Code Online (Sandbox Code Playgroud)

您可以通过将输入文件视为迭代器并使用csv模块来至少简化代码.这也可以为您提供加速,因为它不需要显式调用split:

with open(filepath, "r") as open_file:
    reader = csv.reader(open_file, delimiter="\t")
    writer = csv.writer(streamed_file)
    for size, path in reader:
       is_dir = os.path.isdir(path)
       writer.writerow([is_dir, size, path])

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	86 次
最近记录：	7 年，4 月前