BeautifulSoup：提取不在给定标签中的文本

Question

BeautifulSoup：提取不在给定标签中的文本

mel*_*mel 3 html python beautifulsoup web-scraping python-3.x

我有以下变量，header等于：

<p>Andrew Anglin<br/>
<strong>Daily Stormer</strong><br/>
February 11, 2017</p>

Run Code Online (Sandbox Code Playgroud)

我只想从这个变量中提取 date February 11, 2017。我如何在 python 中使用 BeautifulSoup 来做到这一点？

Answer 1

Jos*_*ier 5

如果您知道日期始终是标头变量中的最后一个文本节点，那么您可以访问该.contents属性并获取返回列表中的最后一个元素：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.contents[-1].strip()
> February 11, 2017

Run Code Online (Sandbox Code Playgroud)

或者，正如MYGz 在下面的评论中指出的那样，您可以在新行处拆分文本并检索列表中的最后一个元素：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

header.text.split('\n')[-1]
> February 11, 2017

Run Code Online (Sandbox Code Playgroud)

如果您不知道日期文本节点的位置，则另一种选择是解析出任何匹配的字符串：

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

re.findall(r'\w+ \d{1,2}, \d{4}', header.text)[0]
> February 11, 2017

Run Code Online (Sandbox Code Playgroud)

但是，正如您的标题所暗示的，如果您只想检索未用元素标签包装的文本节点，那么您可以使用以下内容过滤掉元素：

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

text_nodes = [e.strip() for e in header if not e.name and e.strip()]

Run Code Online (Sandbox Code Playgroud)

请记住，由于第一个文本节点未包装，因此将返回以下内容：

> ['Andrew Anglin', 'February 11, 2017']

Run Code Online (Sandbox Code Playgroud)

当然，您也可以结合最后两个选项并解析返回文本节点中的日期字符串：

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')
header = soup.find('p')

for node in header:
    if not node.name and node.strip():
        match = re.findall(r'^\w+ \d{1,2}, \d{4}$', node.strip())
        if match:
            print(match[0])

> February 11, 2017

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，1 月前
查看次数：	3363 次
最近记录：	9 年，1 月前