使用Python通过行号将大文本文件拆分为较小的文本文件

Question

使用Python通过行号将大文本文件拆分为较小的文本文件

我有一个文本文件说real_big_file.txt包含:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

Run Code Online (Sandbox Code Playgroud)

我想编写一个Python脚本,将really_big_file.txt分成较小的文件,每个文件有300行.例如,small_file_300.txt包含1-300行,small_file_600包含301-600行,依此类推,直到有足够的小文件包含大文件中的所有行.

我很感激有关使用Python实现此目的的最简单方法的任何建议

Answer 1

Mat*_*son 28

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

Run Code Online (Sandbox Code Playgroud)

尼斯，短代码，魅力十足 (2认同)

Answer 2

jam*_*lak 21

使用itertools石斑鱼配方:

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

Run Code Online (Sandbox Code Playgroud)

与将每行存储在列表中相比,此方法的优点在于它可以逐行处理迭代,因此不必small_file一次将每个行存储到内存中.

请注意,在这种情况下,最后一个文件将是,small_file_100200但只会到line 100000.发生这种情况是因为fillvalue='',当我没有剩余的行要写时,我没有写任何文件,因为组大小不均等.您可以通过写入临时文件然后重命名它而不是像我一样命名它来解决这个问题.这是如何做到的.

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

Run Code Online (Sandbox Code Playgroud)

这一次fillvalue=None,我经过的每一行检查None,当它发生时,我知道这个过程已经完成,所以我减去1从j不计填料,然后写入文件.

如果您正在使用python 3.x中的第一个脚本,请将```izip_longest```替换为新的```zip_longest```https://docs.python.org/3/library/itertools.html# itertools.zip_longest (3认同)

Answer 3

Rya*_*axe 6

我以更容易理解的方式进行此操作，并使用更少的捷径，以便让您进一步了解其工作原理和原因。前面的答案是有效的，但是如果您不熟悉某些内置函数，您将无法理解该函数在做什么。

因为你没有发布代码，所以我决定这样做，因为你可能不熟悉基本 python 语法以外的东西，因为你表达问题的方式让人觉得你没有尝试，也没有任何关于如何处理问题的线索。问题

以下是在基本 python 中执行此操作的步骤：

首先，您应该将文件读入列表中以进行妥善保管：

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

Run Code Online (Sandbox Code Playgroud)

其次，您需要设置一种按名称创建新文件的方法！我建议一个循环和几个计数器：

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

Run Code Online (Sandbox Code Playgroud)

第三，在该循环内，您需要一些嵌套循环，将正确的行保存到数组中：

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

Run Code Online (Sandbox Code Playgroud)

最后一件事，在第一个循环中，您需要写入新文件并添加最后一个计数器增量，以便您的循环将再次执行并写入新文件

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

Run Code Online (Sandbox Code Playgroud)

注意：如果行数不能被 300 整除，则最后一个文件的名称将与最后一个文件行不对应。

了解这些循环为何起作用非常重要。您已设置它，以便在下一个循环中，您编写的文件的名称会发生变化，因为您的名称依赖于不断变化的变量。这是一个非常有用的脚本工具，用于文件访问、打开、写入、组织等。

如果您无法理解循环中的内容，这里是整个函数：

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，6 月前
查看次数：	38571 次
最近记录：	6 年，11 月前