如何使用美丽的汤和重新找到包含特定文本的特定类的跨度？

Question

如何使用美丽的汤和重新找到包含特定文本的特定类的跨度？

如何找到'blue'包含以下格式的文本类的所有span :

04/18/13 7:29pm

Run Code Online (Sandbox Code Playgroud)

因此可能是:

04/18/13 7:29pm

Run Code Online (Sandbox Code Playgroud)

要么:

Posted on 04/18/13 7:29pm

Run Code Online (Sandbox Code Playgroud)

在构建执行此操作的逻辑方面,这是我到目前为止所得到的:

new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
    result = re.findall(pattern, _)
    print result

Run Code Online (Sandbox Code Playgroud)

我一直指的是/sf/answers/541297921/和/sf/answers/856039411/试图找到一种方法来做到这一点,但以上就是我到目前为止所有的.

编辑:

为了澄清这个场景,有以下几点:

<span class="blue">here is a lot of text that i don't need</span>

Run Code Online (Sandbox Code Playgroud)

和

<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>

Run Code Online (Sandbox Code Playgroud)

并注意我只需要04/18/13 7:29pm其他内容.

编辑2:

我也尝试过:

pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
    result = re.findall(pattern, _)
    print result

Run Code Online (Sandbox Code Playgroud)

并得到错误:

'TypeError: expected string or buffer'

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cor*_*erg 17

import re
from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""

# parse the html
soup = BeautifulSoup(html_doc)

# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})

# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]

# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
    m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
    if m:
        found_dates.append(m.group(1))

# print the dates we collected
for date in found_dates:
    print(date)

Run Code Online (Sandbox Code Playgroud)

输出:

04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，7 月前
查看次数：	24854 次
最近记录：	12 年前