BeautifulSoup,在HTML标记,ResultSet对象中提取字符串

Sha*_*ang 4 html beautifulsoup python-requests

我很困惑我如何使用BeautifulSoup的ResultSet对象,即bs4.element.ResultSet.

使用后find_all(),如何提取文字?

例:

bs4文档中,HTML文档html_doc如下所示:

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>
Run Code Online (Sandbox Code Playgroud)

一个开始创建soup并找到所有href,

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')
Run Code Online (Sandbox Code Playgroud)

哪个输出

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Run Code Online (Sandbox Code Playgroud)

我们也可以这样做

for link in soup.find_all('a'):
    print(link.get('href'))
Run Code Online (Sandbox Code Playgroud)

哪个输出

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
Run Code Online (Sandbox Code Playgroud)

我想得到来自的文本class_="sister",即

Elsie
Lacie
Tillie
Run Code Online (Sandbox Code Playgroud)

有人可以试试

for link in soup.find_all('a'):
    print(link.get_text())
Run Code Online (Sandbox Code Playgroud)

但这会导致错误:

AttributeError: 'ResultSet' object has no attribute 'get_text'
Run Code Online (Sandbox Code Playgroud)

Joe*_*ung 6

find_all()过滤class_='sister'.

注:请注意下划线class.这是一个特例,因为class是一个保留字.

搜索具有特定CSS类的标记非常有用,但CSS属性的名称"class"是Python中的保留字.使用class作为关键字参数会给出语法错误.从Beautiful Soup 4.1.2开始,您可以使用关键字参数通过CSS类进行搜索class_:

资料来源: http ://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦你拥有了所有类姐妹的标签,请打电话.text给他们获取文本.一定要删除文本.

例如:

from bs4 import BeautifulSoup

html_doc = '''<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
    print tag.text.strip()
Run Code Online (Sandbox Code Playgroud)

输出:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie
Run Code Online (Sandbox Code Playgroud)