我有一个功能设置,用于Pandas运行大量的行,input.csv并将结果输入到系列中.然后它将系列写入output.csv.
但是,如果进程中断(例如,由于意外事件),程序将终止,并且所有进入csv的数据都将丢失.
有没有办法将数据连续写入csv,无论该函数是否为所有行完成?
最好是,每次程序启动时,output.csv都会创建一个空白,该空白在函数运行时附加到空白处.
import pandas as pd
df = pd.read_csv("read.csv")
def crawl(a):
#Create x, y
return pd.Series([x, y])
df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)
Run Code Online (Sandbox Code Playgroud)
Tom*_*tel 15
这是一种可能的解决方案,它将数据附加到新文件中,因为它以块的形式读取csv.如果进程中断,则新文件将包含中断之前的所有信息.
import pandas as pd
#csv file to be read in
in_csv = '/path/to/read/file.csv'
#csv to write data to
out_csv = 'path/to/write/file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of chunks of data to write to the csv
chunksize = 10
#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
df = pd.read_csv(in_csv,
header=None,
nrows = chunksize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=chunksize)#size of data to append for each loop
Run Code Online (Sandbox Code Playgroud)
最后,这就是我想出的办法。感谢您的帮助!
import pandas as pd
df1 = pd.read_csv("read.csv")
run = 0
def crawl(a):
global run
run = run + 1
#Create x, y
df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])
if run == 1:
df2.to_csv("output.csv")
if run != 1:
df2.to_csv("output.csv", header=None, mode="a")
df1["Column A"].apply(crawl)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
10172 次 |
| 最近记录: |