我试图在CSV格式的字符串中处理未匹配的双引号.
确切地说,
"It "does "not "make "sense", Well, "Does "it"
Run Code Online (Sandbox Code Playgroud)
应该更正为
"It" "does" "not" "make" "sense", Well, "Does" "it"
Run Code Online (Sandbox Code Playgroud)
所以基本上我要做的就是
替换所有'''
- 前面没有行开头或逗号(和)
- 后面没有逗号或行尾
与'""'
为此,我使用下面的正则表达式
(?<!^|,)"(?!,|$)
Run Code Online (Sandbox Code Playgroud)
问题是,虽然红宝石正则表达式引擎(http://www.rubular.com/)都能够解析正则表达式,蟒蛇正则表达式引擎(https://pythex.org/,http://www.pyregex.com/)抛出以下错误
Invalid regular expression: look-behind requires fixed-width pattern
Run Code Online (Sandbox Code Playgroud)
并且使用python 2.7.3它会抛出
sre_constants.error: look-behind requires fixed-width pattern
Run Code Online (Sandbox Code Playgroud)
谁能告诉我这里有什么烦恼?
================================================== ================================
在Tim的回应之后,我获得了多行字符串的以下输出
>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '
Run Code Online (Sandbox Code Playgroud)
在每一行的末尾,在'它'旁边添加了两个双引号.
所以我对正则表达式做了一个非常小的改动来处理一个换行符.
re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
Run Code Online (Sandbox Code Playgroud)
但这给出了输出
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '
Run Code Online (Sandbox Code Playgroud)
最后一个'它'只有两个双引号.
但我想知道为什么'$'行尾字符不会识别该行已经结束.
================================================== ================================
最后的答案是
re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)
Run Code Online (Sandbox Code Playgroud)
Wik*_*żew 32
Python lookbehinds确实需要固定宽度,当你在一个不同长度的lookbehind模式中进行交替时,有几种方法可以处理这种情况:
(?<=[^,])"(?!,|$)当前模式的完全等价物,它需要在双引号之前使用逗号而不是逗号,或者是常见的用于匹配用空格括起来的单词的模式(?<=\s|^)\w+(?=\s|$),可以写成(?<!\S)\w+(?!\S)),或(?<=a|bc)应该重写为(?:(?<=a)|(?<=bc)))(?<!^|,)"(?!,|$)外观可以被连接起来(例如应该看起来像(?<!^)(?<!,)"(?!,|$)).Tim*_*ker 17
Python lookbehind断言需要固定宽度,但你可以试试这个:
>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'
Run Code Online (Sandbox Code Playgroud)
说明:
\b # Start the match at the end of a "word"
\s* # Match optional whitespace
" # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
Run Code Online (Sandbox Code Playgroud)