小编Vin*_*hah的帖子

代码运行缓慢 - python 中的性能问题

我的文件有 4 列，其中有分隔值。我只需要第一列，所以我读取了文件，然后将该行拆分，分隔并将其存储在一个名为first_file_list 的列表变量中。

我有另一个文件，其中有 6 列，其中包含分隔值。我的要求是读取文件第一行的第一列，并检查字符串是否存在于名为first_file_list 的列表中。如果存在，则将该行复制到新文件中。

我的第一个文件大约有。600 万条记录，第二个文件大约有 600 万条记录。450 万条记录。只是为了检查我的代码的性能而不是 450 万条，我只在第二个文件中放入了 100k 条记录，并且处理 100k 条记录代码大约需要 100000 条记录。2.5小时。

以下是我的逻辑：

first_file_list = []

with open("c:\first_file.csv") as first_f:
    next(first_f)  # Ignoring first row as it is header and I don't need that
    temp = first_f.readlines()
    for x in temp:
        first_file_list.append(x.split(',')[0])
first_f.close()

with open("c:\second_file.csv") as second_f:
    next(second_f)
    second_file_co = second_f.readlines()
second_f.close()

out_file = open("c:\output_file.csv", "a")
for x in second_file_co:
    if x.split(',')[0] in first_file_list:
        out_file.write(x)
out_file.close()

Run Code Online (Sandbox Code Playgroud)

您能否帮助我了解我在这里做错了什么，以至于我的代码需要这么多时间来比较 100k 记录？或者你能建议更好的方法在Python中做到这一点吗？

python list file-comparison filecompare python-3.x

Vin*_*hah

2021 09-08

2
推荐指数

1
解决办法

190
查看次数