BeautifulSoup：如何获取嵌套的div

Question

BeautifulSoup：如何获取嵌套的div

tor*_*orr 6 python beautifulsoup web-scraping

给出以下代码：

<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>

Run Code Online (Sandbox Code Playgroud)

如何test从<div class="category5"> testBeautifulSoup中提取单词，即如何处理嵌套的div？我尝试在Internet上查找，但是没有找到任何可以轻松理解的示例，因此我设置了这个示例。谢谢。

Answer 1

Anz*_*zel 6

xpath应该是直接的答案，但是BeautifulSoup。

更新：带有BeautifulSoup解决方案

为此，假设您知道这种情况下的类和元素（div），则可以使用for/loopwith attrs来获取所需的内容：

from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>'''

content = BeautifulSoup(html)

for div in content.findAll('div', attrs={'class':'category5'}):
    print div.text

test

Run Code Online (Sandbox Code Playgroud)

我毫不费力地从html示例中提取文本，就像@MartijnPieters建议的那样，您将需要找出为什么div元素丢失的原因。

另一个更新

由于您缺少lxml的解析器BeautifulSoup，因此未返回任何内容的原因就是没有返回。安装lxml应该可以解决您的问题。

您可以考虑使用lxml支持xpath的类似方法，如果您询问我，也很容易。

from lxml import etree

tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[@class="category5"]/text()')
[' test\n                 ']

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年前
查看次数：	9488 次
最近记录：	11 年前