如何编写一个BeautifulSoup滤网,只解析标签之间某些文字的对象？

Question

如何编写一个BeautifulSoup滤网,只解析标签之间某些文字的对象？

Dav*_*ave 6 python django parsing beautifulsoup python-3.x

我正在使用Django和Python 3.7.我希望有更高效的解析,所以我正在阅读有关SoupStrainer对象的内容.我创建了一个自定义的,以帮助我解析我需要的元素...

def my_custom_strainer(self, elem, attrs):
    for attr in attrs:
        print("attr:" + attr + "=" + attrs[attr])
    if elem == 'div' and 'class' in attr and attrs['class'] == "score":
        return True
    elif elem == "span" and elem.text == re.compile("my text"):
        return True

article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)

Run Code Online (Sandbox Code Playgroud)

其中一个条件是我只想解析其文本与特定模式匹配的"span"元素.因此

elem == "span" and elem.text == re.compile("my text")

Run Code Online (Sandbox Code Playgroud)

条款.然而,这导致了

AttributeError: 'str' object has no attribute 'text'

Run Code Online (Sandbox Code Playgroud)

我尝试运行上面的错误.写过滤器的正确方法是什么？

Answer 1

dar*_*ess 5

TLDR；不，这在 BeautifulSoup 中目前不容易实现（需要修改 BeautifulSoup 和 SoupStrainer 对象）。

解释：

问题是过滤器传递的函数在handle_starttag()方法上被调用。您可以猜到，您只有开始标记中的值（例如，元素名称和属性）。

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/初始化的.py＃L524

if (self.parse_only and len(self.tagStack) <= 1
    and (self.parse_only.text
     or not self.parse_only.search_tag(name, attrs))):
return None

Run Code Online (Sandbox Code Playgroud)

正如您所看到的，如果您的过滤器函数返回 False，则该元素会立即被丢弃，而没有机会考虑内部文本（不幸的是）。

另一方面，如果您添加“文本”进行搜索。

SoupStrainer(text="my text")

Run Code Online (Sandbox Code Playgroud)

它将开始在标签内搜索文本，但这没有元素或属性的上下文 - 您可以看到讽刺：/

并将其组合在一起将一无所获。而且您甚至无法像在 find 函数中显示的那样访问 parent：https : //gist.github.com/RichardBronosky/4060082

所以目前过滤器只是很好地过滤元素/属性。您需要更改大量 Beautiful 汤代码才能使其正常工作。

如果你真的需要这个，我建议继承 BeautifulSoup 和 SoupStrainer 对象并修改它们的行为。

归档时间：	6 年，9 月前
查看次数：	426 次
最近记录：	6 年，9 月前