我是Python初学者,并且已经编写了一些基本脚本。我的最新挑战是根据每行中特定变量的值,将一个非常大的csv文件(10gb +)分成多个较小的文件。
例如,该文件可能如下所示:
Category,Title,Sales
"Books","Harry Potter",1441556
"Books","Lord of the Rings",14251154
"Series", "Breaking Bad",6246234
"Books","The Alchemist",12562166
"Movie","Inception",1573437
Run Code Online (Sandbox Code Playgroud)
我想将文件拆分为单独的文件:Books.csv,Series.csv,Movie.csv
实际上,将有数百种类别,并且不会对其进行排序。在这种情况下,它们位于第一列,但将来可能不在。
我在网上找到了一些解决方案,但是在Python中却没有。有一个非常简单的AWK命令可以在一行中完成此操作,但是我无法在工作中访问AWK。
我编写了以下有效的代码,但我认为它可能效率很低。有人可以建议如何加快速度吗?
import csv
#Creates empty set - this will be used to store the values that have already been used
filelist = set()
#Opens the large csv file in "read" mode
with open('//directory/largefile', 'r') as csvfile:
#Read the first row of the large file and store the whole row as a string (headerstring)
read_rows = csv.reader(csvfile)
headerrow = next(read_rows)
headerstring=','.join(headerrow)
for row in read_rows:
#Store the whole row as a string (rowstring)
rowstring=','.join(row)
#Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use
filename = (row[0])
#This basically makes sure it is not looking at the header row.
if filename != "Category":
#If the filename is not in the filelist set, add it to the list and create new csv file with header row.
if filename not in filelist:
filelist.add(filename)
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(headerstring)
f.write("\n")
f.close()
#If the filename is in the filelist set, append the current row to the existing csv file.
else:
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(rowstring)
f.write("\n")
f.close()
Run Code Online (Sandbox Code Playgroud)
谢谢!
小智 6
我面临着同样的问题,这让我进入了这份问卷,我能够在 Pandas 中提供它。
逻辑:
请检查一次这是否适用于您的情况:
import pandas as pd
data = pd.read_csv(**filename**)
data_category_range = data['Category'].unique()
data_category_range = data_category_range.tolist()
for i,value in enumerate(data_category_range):
data[data['Category'] == value].to_csv(r'Category_'+str(value)+r'.csv',index = False, na_rep = 'N/A')
Run Code Online (Sandbox Code Playgroud)
一种内存有效的方法,一种避免将重新打开文件追加到此处的方法(只要您不打算生成大量打开的文件句柄),就是使用一种dict将类别映射到fileobj。在尚未打开该文件的地方,然后创建它并写入标题,然后始终将所有行写入相应的文件,例如:
import csv
with open('somefile.csv') as fin:
csvin = csv.DictReader(fin)
# Category -> open file lookup
outputs = {}
for row in csvin:
cat = row['Category']
# Open a new file and write the header
if cat not in outputs:
fout = open('{}.csv'.format(cat), 'w')
dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
dw.writeheader()
outputs[cat] = fout, dw
# Always write the row
outputs[cat][1].writerow(row)
# Close all the files
for fout, _ in outputs.values():
fout.close()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2076 次 |
| 最近记录: |