Python 正则表达式无法按预期工作

Question

Python 正则表达式无法按预期工作

我精心设计了这个正则表达式：

<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>

Run Code Online (Sandbox Code Playgroud)

解析以下RSS 提要：

<?xml version="1.0" encoding="UTF-8"?>\n<feed version="0.3" xmlns="http://purl.org/atom/ns#">\n<title>Gmail - Inbox for g.bargelli@gmail.com</title>\n<tagline>New messages in your Gmail Inbox</tagline>\n<fullcount>2</fullcount>\n<link rel="alternate" href="http://mail.google.com/mail" type="text/html" />\n<modified>2011-03-15T11:07:48Z</modified>\n<entry>\n<title>con due mail...</title>\n<summary>Gianluca Bargelli http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb9332c2c1fa27&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:07:42Z</modified>\n<issued>2011-03-15T11:07:42Z</issued>\n<id>tag:gmail.google.com,2004:1363345158434847271</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n<entry>\n<title>test nuova mail</title>\n<summary>Gianluca Bargelli sono tornato!?& http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb93140d9f7627&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:05:36Z</modified>\n<issued>2011-03-15T11:05:36Z</issued>\n<id>tag:gmail.google.com,2004:1363345026546890279</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n</feed>\n'skinner.com/products/spl].

Run Code Online (Sandbox Code Playgroud)

问题是我没有通过使用Python 的 re 模块获得任何匹配项：

import re

regex = re.compile("""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")
regex.findall(rss_string) # Returns an empty list

Run Code Online (Sandbox Code Playgroud)

使用在线正则表达式测试器（例如这个）可以按预期工作，所以我认为这不是正则表达式问题。

编辑

我很清楚使用正则表达式来解析上下文无关语法是不好的，但在我的情况下，正则表达式可能只适用于那个 RSS 提要（顺便说一下，它是一个 Gmail 收件箱提要），我知道我可以使用外部库/xml 解析器来完成此任务：这只是练习，而不是习惯。

问题应该是为什么以下正则表达式在 Python 中不能按预期工作？

Answer 1

dap*_*wit 5

在正则表达式编译器看到字符串之前，Python 已经处理了斜线转义符，因此您必须对其进行两次转义（例如\\\\nfor \\n）。然而，Python 对这类事情有一个方便的表示法，只需r在字符串前加上一个：

regex = re.compile(r"""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")

Run Code Online (Sandbox Code Playgroud)

顺便说一句，我同意这里的其他人，不要使用正则表达式来解析 XML。但是，希望您会发现此字符串表示法在以后的正则表达式中很有用。

归档时间：	14 年，6 月前
查看次数：	3382 次
最近记录：	14 年，6 月前