慢python文件我:O; Ruby运行得比这更好; 得错了语言?

ros*_*ser 0 ruby python regex text

请指教 - 我将把它作为学习点.我是初学者.

我正在将一个25mb的文件拆分成几个较小的文件.

一位善良的大师在这里给了我一个Ruby sript.它的工作速度非常快.所以,为了学习我用python脚本模仿它.这就像一只三条腿的猫(慢).我想知道是否有人可以告诉我为什么?

我的python脚本

    ##split a file into smaller files
###########################################
def splitlines (file) :
        fileNo=0001
        outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
        fh = open(file, "r") ## open the file for reading
        mylines = fh.readlines() ### read in lines
        for line in mylines: ## for each line
                        if re.search("Copyright ", line): # if the line is equal to the regex
                            outFile.close()  ##  close the file
                            fileNo +=1  #and add one to the filename, starting to read lines in again
                        else: # otherwise
                            outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append 
                            outFile.write(line)          ## then append it to the open outFile          
        fh.close()
Run Code Online (Sandbox Code Playgroud)

大师的Ruby 1.9脚本

g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
  if line[/\d+ of \d+ DOCUMENTS/]
    f.close
    f=File.open(g.to_s + ".txt","w")
    g+=1
  end
  f.print line
end
Run Code Online (Sandbox Code Playgroud)

Sve*_*ach 6

您的脚本速度很慢的原因有很多 - 主要原因是您几乎每行都重新打开输出文件.由于旧文件在打开一个新文件时被隐式关闭(由于Python垃圾收集),因此对于您编写的每一行都会刷新写入缓冲区,这非常昂贵.

您的脚本的清理和更正版本将是

def file_generator():
    file_no = 1
    while True:
        f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
                 r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
        yield f
        f.close()
        file_no += 1

def splitlines(filename):
    files = file_generator()
    out_file = next(files)
    with open(filename) as in_file:
        for line in in_file:
            if "Copyright " in line:
                out_file = next(files)
            out_file.write(line)
        out_file.close()
Run Code Online (Sandbox Code Playgroud)