相关疑难解决方法(0)

只阅读特定的行

我正在使用for循环来读取文件,但我只想读取特定的行,比如第26行和第30行.是否有任何内置功能来实现这一目标？

谢谢

python file line

3zz*_*zzy

2018 12-15

197
推荐指数

11
解决办法

49万
查看次数

从不同的大文件中打乱数据的有效方法

举例来说，我已经是df1并且df2在不同的领域：

df1 = pd.DataFrame({"question":["q1","q2"], "answer":["a1","a2"], "domain":"tech"})
df2 = pd.DataFrame({"question":["q3","q4"], "answer":["a3","a4"], "domain":"history"})

print(df1)
  question answer domain
0       q1     a1   tech
1       q2     a2   tech

print(df2)
  question answer   domain
0       q3     a3  history
1       q4     a4  history

Run Code Online (Sandbox Code Playgroud)

我想要的是混洗后的数据：

print(shuffled1)
  question answer   domain
0       q3     a3  history
1       q1     a1     tech
print(shuffled2)
  question answer   domain
0       q2     a2     tech
1       q4     a4  history

Run Code Online (Sandbox Code Playgroud)

在现实世界中，我有来自不同域的 60 多个具有相同结构的 csv 文件。每个文件有 50k 条记录。它们不能同时读入内存。

我想要做的是将这些文件输入到 Bert 模型中进行训练，但是如果模型从“历史”域中学习 10k 步的数据，然后从另外 10k 步的“技术”域中学习，则该模型会做得不好。所以我想打乱文件中的数据，使多个域的数据均匀分布在每个文件中。

python dataframe pandas

Daw*_*wei

2020 01-23

5
推荐指数

1
解决办法

465
查看次数

Python随机从大文件N行(没有重复的行)

我需要使用python从大型txt文件中获取N行.这些文件基本上是制表符分隔的表.我的任务有以下限制:

这些文件可能包含标题(某些文件包含多行标题).
标题需要以相同的顺序出现在输出中.
每行只能使用一次.
目前最大的文件大约是150GB(大约6,000,000行).
行在文件中的长度大致相同,但可能在不同文件之间有所不同.
我通常会随机抽取5000行(我可能需要多达1 000 000行)

目前我已经编写了以下代码:

inputSize=os.path.getsize(options.input)
usedPositions=[] #Start positions of the lines already in output

with open(options.input) as input:
    with open(options.output, 'w') as output:

        #Handling of header lines
        for i in range(int(options.header)):
            output.write(input.readline())
            usedPositions.append(input.tell())

        # Find and write all random lines, except last
        for j in range(int(args[0])):
            input.seek(random.randrange(inputSize)) # Seek to random position in file (probably middle of line)
            input.readline() # Read the line (probably incomplete). Next input.readline() results in a complete line.
            while input.tell() …

Run Code Online (Sandbox Code Playgroud)

python random line readline large-files

Fab*_*aze

2012 09-05

3
推荐指数

1
解决办法

3365
查看次数