zak*_*191 0 python regex string
我正在寻找一种简单的方法来打开一个文件,并搜索每一行,看看该行是否有未关闭的parens和引号.如果该行具有未闭合的parens/quotes,我想将该行打印到文件中.我知道我可以用一个丑陋的if/for语句来做这件事,但我知道python可能有一个更好的方法与re模块(我什么都不知道)或其他东西,但我不知道语言足够好这样做.
谢谢!
编辑:一些示例行.如果将其复制到记事本或其他内容并关闭自动换行(某些行可能很长),可能更容易阅读.此外,文件中有超过100k行,所以有效的东西会很棒!
SL ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13
Run Code Online (Sandbox Code Playgroud)
如果你不认为会有倒退的无与伦比的parens(即")"("),你可以这样做:
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
for line in readfile:
if line.count("(") != line.count(")") or line.count('"') % 2 != 0:
outfile.write(line)
Run Code Online (Sandbox Code Playgroud)
否则你将不得不一次计算一次,以检查是否存在不匹配,如下所示:
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
for line in readfile:
count = 0
for char in line:
if char == ")":
count -= 1
elif char == "(":
count += 1
if count < 0:
break
if count != 0 or text.count('"') % 2 != 0:
outfile.write(line)
Run Code Online (Sandbox Code Playgroud)
我想不出更好的办法来处理它.Python不支持递归正则表达式,因此正则表达式解决方案正好出来.
还有一件事:给定你的数据,将它放入函数并拆分你的字符串可能会更好,这很容易用正则表达式,如下所示:
import re
splitre = re.compile(".*?=(.*?)(?:(?=\s*?\S*?=)|(?=\s*$))")
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
for line in readfile:
def matchParens(text):
count = 0
for char in text:
if char == ")":
count -= 1
elif char == "(":
count += 1
if count < 0:
break
return count != 0 or text.count('"') % 2 != 0
if any(matchParens(text) for text in splitre.findall(line)):
outfile.write(line)
Run Code Online (Sandbox Code Playgroud)
可能更好的原因是它会单独检查每个值对,如果你在一个值对中有一个开放的paren而在后一个值中有一个紧密的paren,它就不会认为没有不平衡的parens.
使用解析器包似乎有些过分,但它很快:
text = """\
SL ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13 GOOD
PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C429A13 GOOD
PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C42"("9A13 GOOD
"""
from pyparsing import nestedExpr, quotedString
paired_exprs = nestedExpr('(',')') | quotedString
for i, line in enumerate(text.splitlines(), start=1):
# use pyparsing expression to strip out properly nested quotes/parentheses
stripped_line = paired_exprs.suppress().transformString(line)
# if there are any quotes or parentheses left, they were not
# properly nested
if any(unwanted in stripped_line for unwanted in '()"\''):
print i, ':', line
Run Code Online (Sandbox Code Playgroud)
打印:
10 : PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
11 : PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
13 : PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD
Run Code Online (Sandbox Code Playgroud)