ros*_*ser 0 ruby python regex text
请指教 - 我将把它作为学习点.我是初学者.
我正在将一个25mb的文件拆分成几个较小的文件.
一位善良的大师在这里给了我一个Ruby sript.它的工作速度非常快.所以,为了学习我用python脚本模仿它.这就像一只三条腿的猫(慢).我想知道是否有人可以告诉我为什么?
我的python脚本
##split a file into smaller files
###########################################
def splitlines (file) :
fileNo=0001
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
if re.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
fileNo +=1 #and add one to the filename, starting to read lines in again
else: # otherwise
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
outFile.write(line) ## then append it to the open outFile
fh.close()
Run Code Online (Sandbox Code Playgroud)
大师的Ruby 1.9脚本
g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
f=File.open(g.to_s + ".txt","w")
g+=1
end
f.print line
end
Run Code Online (Sandbox Code Playgroud)
您的脚本速度很慢的原因有很多 - 主要原因是您几乎每行都重新打开输出文件.由于旧文件在打开一个新文件时被隐式关闭(由于Python垃圾收集),因此对于您编写的每一行都会刷新写入缓冲区,这非常昂贵.
您的脚本的清理和更正版本将是
def file_generator():
file_no = 1
while True:
f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
yield f
f.close()
file_no += 1
def splitlines(filename):
files = file_generator()
out_file = next(files)
with open(filename) as in_file:
for line in in_file:
if "Copyright " in line:
out_file = next(files)
out_file.write(line)
out_file.close()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
821 次 |
| 最近记录: |