根据 Beautifulsoup 中的内容排除标签

Question

根据 Beautifulsoup 中的内容排除标签

the*_*est 4 python beautifulsoup web-scraping

我正在抓取类似于以下内容的 html 数据：

<div class="target-content">
    <p id="random1">
      "the content of the p"
    </p>

    <p id="random2">
      "the content of the p"
    </p>

    <p>
      <q class="semi-predictable">
         "q tag content that I don't want
      </q>
    </p>

    <p id="random3">
      "the content of the p"
    </p>

</div>

Run Code Online (Sandbox Code Playgroud)

我的目标是获取所有标签及其内容，同时能够排除<q>标签及其内容。目前，我使用以下方法获取所有标签：

contentlist = soup.find('div', class_='target-content').find_all('p')

Run Code Online (Sandbox Code Playgroud)

我的问题是，在找到所有标签的结果集之后，如何过滤掉包含<q>?

注意：从获取结果集后soup.find('div', class_='target-content')find_all('p')，我以以下方式将结果集中的每个迭代添加到列表中：

content = ''
    for p in contentlist:
        content += str(p)

Run Code Online (Sandbox Code Playgroud)

Answer 1

ale*_*cxe 5

您可以跳过包含p标签的q标签：

for p in soup.select('div.target-content > p'):
    if p.q:  # if q is present - skip
        continue
    print(p)

Run Code Online (Sandbox Code Playgroud)

p.q的快捷方式在哪里p.find("q")。div.target-content > p是一个CSS 选择器，它将匹配所有元素的p直接子div元素与target-content类的标签。

归档时间：	9 年，7 月前
查看次数：	1429 次
最近记录：	9 年，7 月前