Python正则表达式提取html段落

Question

Python正则表达式提取html段落

Cur*_*ous -2 html python regex html-parsing

我正在尝试使用以下代码行从HTML中提取parapgraph:

paragraphs = re.match(r'<p>.{1,}</p>', html)

Run Code Online (Sandbox Code Playgroud)

但即使我知道有,也没有返回.为什么？

Answer 1

ale*_*cxe 11

为什么不使用HTML解析器来解析HTML.示例使用BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
...     <div>
...         <p>text1</p>
...         <p></p>
...         <p>text2</p>
...     </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']

Run Code Online (Sandbox Code Playgroud)

请注意,这text=True有助于过滤掉空段落.

Answer 2

Mar*_*cny 6

确保您使用re.search(或re.findall) 而不是re.match，它会尝试匹配整个 html 字符串（您的 html 绝对不是以<p>标签）。

还应该注意，目前您的搜索是贪婪的，这意味着它将返回第一个 <p>标签和最后一个标签之间的所有内容，</p>这是您绝对不想要的。尝试

re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)

Run Code Online (Sandbox Code Playgroud)

反而。问号将使您的正则表达式在第一个结束</p>标记处停止匹配，并findall返回与search.

归档时间：	10 年，5 月前
查看次数：	2772 次
最近记录：	10 年，5 月前