为什么我的代码会停止?

Bil*_*ill 2 python regex string-matching

嘿我遇到了一个问题,我的程序停止在57802记录中迭代文件由于某些原因我无法弄清楚.我放了一个心跳部分,所以我可以看到它在哪条线上并且它有所帮助,但现在我被困在为什么它停在这里.我认为这是一个内存问题,但我只是在我的6GB内存计算机上运行它仍然停止.

有没有更好的方法来做我在下面做的任何事情?我的目标是读取文件(如果您需要我发送给您,我可以15MB文本日志)根据正则表达式找到匹配并打印匹配行.还有更多,但就我而言.我正在使用python 2.6

任何想法也会帮助和编码评论!我是一个python noob,我还在学习.

import sys, os, os.path, operator
import re, time, fileinput

infile = os.path.join("C:\\","Python26","Scripts","stdout.log")

start = time.clock()

filename  = open(infile,"r")

match = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d{3} +\w+ +\[([\w.]+)\] ((\w+).?)+:\d+ - (\w+)_SEARCH:(.+)')

count = 0
heartbeat = 0
for line in filename:
    heartbeat = heartbeat + 1
    print heartbeat
    lookup = match.search(line)
    if lookup:
        count = count + 1
        print line
end = time.clock()
elapsed = end-start
print "Finished processing at:",elapsed,"secs. Count of records =",count,"."

filename.close()
Run Code Online (Sandbox Code Playgroud)

这是第57802行,它失败了:

2010-08-06 08:15:15,390 DEBUG [ah_admin] com.thg.struts2.SecurityInterceptor.intercept:46 - Action not SecurityAware; skipping privilege check.
Run Code Online (Sandbox Code Playgroud)

这是一个匹配的行:

2010-08-06 09:27:29,545 INFO  [patrick.phelan] com.thg.sam.actions.marketmaterial.MarketMaterialAction.result:223 - MARKET_MATERIAL_SEARCH:{"_appInfo":{"_appId":21,"_companyDivisionId":42,"_environment":"PRODUCTION"},"_description":"symlin","_createdBy":"","_fieldType":"GEO","_geoIds":["Illinois"],"_brandIds":[2883],"_archived":"ACTIVE","_expired":"UNEXPIRED","_customized":"CUSTOMIZED","_webVisible":"VISIBLE_ONLY"}
Run Code Online (Sandbox Code Playgroud)

样本数据只是前5行:

2010-08-06 00:00:00,035 DEBUG [] com.thg.sam.jobs.PlanFormularyLoadJob.executeInternal:67 - Entered into PlanFormularyLoadJob: executeInternal
2010-08-06 00:00:00,039 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/hibbert@tccfp01.hibbertnet.com:21
2010-08-06 00:00:00,040 DEBUG [] com.thg.sam.email.EmailUtils.sendEmail:206 - org.apache.commons.mail.MultiPartEmail@446e79
2010-08-06 00:00:00,045 DEBUG [] com.thg.sam.services.OrderService.getOrdersWithStatus:121 - Orders list size=13
2010-08-06 00:00:00,045 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/hibbert@tccfp01.hibbertnet.com:21
Run Code Online (Sandbox Code Playgroud)

Pau*_*bel 7

给您带来麻烦的输入线是什么样的?我试着打印出来.我怀疑你的CPU在运行时是挂钩的.

嵌套的regexp,就像你没有快速匹配时可能会有非常糟糕的性能.

((\w+).?)+:
Run Code Online (Sandbox Code Playgroud)

想象一个字符串,它没有:在其中,但相当长.你将最终进入一个回溯的世界,因为正则表达式试图在\ w和\n之间分隔单词字符的各种方法组合.然后尝试以各种可能的方式对它们进行分组.如果你可以在你的模式中更具体,它将带来丰厚的回报.