多个标签的Python正则表达式

Question

多个标签的Python正则表达式

我想知道如何从每个<p>标签中检索所有结果.

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

Run Code Online (Sandbox Code Playgroud)

结果:

('item1', )

Run Code Online (Sandbox Code Playgroud)

我需要的:

('item1', 'item2', 'item3')

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pet*_*ton 11

对于此类问题,建议使用DOM解析器,而不是正则表达式.

我见过经常推荐用于Python的Beautiful Soup

Answer 2

Bre*_*Bim 5

美丽的汤肯定是这样的问题的方式.代码更清晰,更易于阅读.安装完成后,获取所有标签就像这样.

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

Run Code Online (Sandbox Code Playgroud)

这将打印出p标签的所有值.

Answer 3

Tri*_*ych 5

正则表达式的答案非常脆弱。这是证明（和一个有效的 BeautifulSoup 示例）。

from BeautifulSoup import BeautifulSoup

# Here's your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here's some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]

Run Code Online (Sandbox Code Playgroud)

使用 BeautifulSoup。

归档时间：	16 年，9 月前
查看次数：	2190 次
最近记录：	16 年，9 月前