我想解析传入的类似CSV的数据行.值用逗号分隔(逗号周围可能有前导和尾随空格),并且可以用'或者"引用.例如 - 这是一个有效的行:
data1, data2 ,"data3'''", 'data4""',,,data5,
Run Code Online (Sandbox Code Playgroud)
但这个是畸形的:
data1, data2, da"ta3", 'data4',
Run Code Online (Sandbox Code Playgroud)
- 引号只能以空格为前缀或尾随.
应该识别这种格式错误的行 - 最好是以某种方式在行内标记格式错误的值,但如果正则表达式与整行不匹配,则它也是可接受的.
我正在尝试使用findall()的match()来编写能够解析它的正则表达式,但是我正在使用的每个正则表达式都存在边缘情况的一些问题.
那么,也许有解析类似经验的人可以帮助我吗?(或者这对于正则表达式来说太复杂了,我应该写一个函数)
EDIT1:
csv 模块在这里没什么用处:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
Run Code Online (Sandbox Code Playgroud)
- 除非可以调整?
EDIT2:一些语言编辑 - 我希望它现在更有效
EDIT3:谢谢你的所有答案,我现在很确定正则表达式在这里不是一个好主意,因为(1)覆盖所有边缘情况可能很棘手(2)编写器输出不规则.写这个,我决定检查提到的pyparsing并使用它,或编写自定义FSM类解析器.
Max*_*keh 12
虽然这里的csv模块是正确的答案,但可以做到这一点的正则表达式是非常可行的:
import re
r = re.compile(r'''
\s* # Any whitespace.
( # Start capturing here.
[^,"']+? # Either a series of non-comma non-quote characters.
| # OR
"(?: # A double-quote followed by a string of characters...
[^"\\]|\\. # That are either non-quotes or escaped...
)* # ...repeated any number of times.
" # Followed by a closing double-quote.
| # OR
'(?:[^'\\]|\\.)*'# Same as above, for single quotes.
) # Done capturing.
\s* # Allow arbitrary space before the comma.
(?:,|$) # Followed by a comma or the end of a string.
''', re.VERBOSE)
line = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
print r.findall(line)
# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']
Run Code Online (Sandbox Code Playgroud)
编辑:要验证行,您可以重复使用上面的正则表达式添加少量:
import re
r_validation = re.compile(r'''
^(?: # Capture from the start.
# Below is the same regex as above, but condensed.
# One tiny modification is that it allows empty values
# The first plus is replaced by an asterisk.
\s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
)*$ # And don't stop until the end.
''', re.VERBOSE)
line1 = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""
if r_validation.match(line1):
print 'Line 1 is valid.'
else:
print 'Line 1 is INvalid.'
if r_validation.match(line2):
print 'Line 2 is valid.'
else:
print 'Line 2 is INvalid.'
# Prints:
# Line 1 is valid.
# Line 2 is INvalid.
Run Code Online (Sandbox Code Playgroud)