在 beautiful soup 中使用 lambda 函数

Question

在 beautiful soup 中使用 lambda 函数

尝试匹配包含某些文本的链接。我正在做

links = soup.find_all('a',href=lambda x: ".org" in x)

Run Code Online (Sandbox Code Playgroud)

但这会引发 TypeError: argument of type 'NoneType' is not iterable。

正确的做法显然是

links = soup.find_all('a',href=lambda x: x and ".org" in x)

Run Code Online (Sandbox Code Playgroud)

x and为什么这里需要额外的？

Answer 1

Ara*_*Fey 5

原因很简单：<a>HTML 中的标签之一没有href属性。

这是重现异常的最小示例：

html = '<html><body><a>bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=lambda x: ".org" in x)
# result:
# TypeError: argument of type 'NoneType' is not iterable

Run Code Online (Sandbox Code Playgroud)

现在，如果我们添加一个href属性，异常就会消失：

html = '<html><body><a href="foo.org">bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=lambda x: ".org" in x)
# result:
# [<a href="foo.org">bar</a>]

Run Code Online (Sandbox Code Playgroud)

发生的情况是 BeautifulSoup 正在尝试访问<a>标签的属性，并且当该属性不存在时href返回：None

html = '<html><body><a>bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

print(soup.a.get('href'))
# output: None

Run Code Online (Sandbox Code Playgroud)

这就是为什么有必要None在 lambda 中允许值。由于None是一个虚假值，因此代码会阻止在is时执行语句x and ...的右侧，如下所示：andxNone

>>> None and 1/0
>>> 'foo.org' and 1/0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

Run Code Online (Sandbox Code Playgroud)

这称为短路。

也就是说，x and ...检查的真实性x，并且None并不是唯一被认为是虚假的值。所以这样x比较会更正确：None

lambda x: x is not None and ".org" in x

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	7544 次
最近记录：	7 年，10 月前