BeautifulSoup如何在标签后提取文本

Question

BeautifulSoup如何在标签后提取文本

我不知道如何使用BeautifulSoup达到以下段落以及如何提取我想要的特定文本.因为我是Python和BS4的新手.

我的HTML如下:

<div class="inner-content">
  <div class="bred"></div>
  <div class="clrbth"></div>
  <h1></h1>
  <h4></h4>
  ...
  ...
  ...
  <p></p>
  <p></p>
  <p>

<!--This text I don't want -->

    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
    <br></br>


<!-- The text I want to extract using BeautifulSoup-->

    It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

  </p>
  <p></p>
  <p></p>
  ...
  ...
  ...
  <div class="bred"></div>
  <div class="clrbth"></div>
  <h1></h1>
 </div>

Run Code Online (Sandbox Code Playgroud)

请告诉我如何从我的HTML中提取上述文本.谢谢.

Answer 1

sty*_*ane 7

您可以使用find_all()方法和limit参数来获取phtml中的第三个标记.接下来使用.find返回br第三段中的第一个标记.从那里你可以使用.next_siblings返回生成器对象和.join函数的方法.

>>> third_p = soup.find_all('p', limit=3)[-1]
>>> ''.join(third_p.find('br').next_siblings)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，3 月前
查看次数：	5037 次
最近记录：	10 年，3 月前