ays*_*sha 10 python regex whitespace python-2.7 shlex
我有一个.txt文件(从网站上格式化为预先格式化的文本),其中数据如下所示:
B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
Run Code Online (Sandbox Code Playgroud)
我想删除列之间的所有额外空格(它们实际上是不同数量的空格,而不是制表符).我还想用一些分隔符替换它(tab或pipe,因为数据中有逗号),如下所示:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Run Code Online (Sandbox Code Playgroud)
环顾四周,发现最好的选择是使用正则表达式或shlex来分割.两个类似的场景:
您可以将正则表达式'\s{2,}'(两个或多个空白字符)应用于每一行,并将匹配替换为单个'|'字符.
>>> import re
>>> line = 'ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'
Run Code Online (Sandbox Code Playgroud)
在应用之前从行中去除任何前导和尾随空格re.sub可确保您不会'|'在行的开头和结尾处获取字符.
您的实际代码应该类似于:
import re
with open(filename) as f:
for line in f:
subbed = re.sub('\s{2,}', '|', line.strip())
# do something here
Run Code Online (Sandbox Code Playgroud)
那这个呢?
your_string ='ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())
Run Code Online (Sandbox Code Playgroud)
输出:
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Run Code Online (Sandbox Code Playgroud)
Expanation:
我使用过re.sub()3个参数,一个模式,一个你要替换的字符串和你想要处理的字符串.
我所做的是将至少两个空间放在一起,我用一个空格替换它们|并将它应用在你的弦上.
s = """B, NICKOLAS CT144531X D1026 JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL JA-15-0050 D0015 JUDGE EDWARD A ROBERTS
"""
# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS
Run Code Online (Sandbox Code Playgroud)