Beautifulsoup使用`find_all`逐个文本找到元素,无论它是否有元素

Bul*_*ula 5 python beautifulsoup

例如

bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))
Run Code Online (Sandbox Code Playgroud)

返回[<a>sometext</a>],但当搜索的元素有一个孩子,即img

bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))
Run Code Online (Sandbox Code Playgroud)

它返回 []

有没有办法用来find_all匹配后面的例子?

Nat*_*usa 14

您将需要使用混合方法,因为text=当元素具有子元素和文本时将失败.

bs = BeautifulSoup("<html><a>sometext</a></html>")    
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]
Run Code Online (Sandbox Code Playgroud)

背景

当BeautifulSoup搜索元素并且text是可调用的时,它最终会调用:

self._matches(found.string, self.text)
Run Code Online (Sandbox Code Playgroud)

在您给出的两个示例中,该.string方法返回不同的内容:

>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None
Run Code Online (Sandbox Code Playgroud)

.string方法如下所示:

@property
def string(self):
    """Convenience property to get the single string within this tag.

    :Return: If this tag has a single string child, return value
     is that string. If this tag has no children, or more than one
     child, return value is None. If this tag has one child tag,
     return value is the 'string' attribute of the child tag,
     recursively.
    """
    if len(self.contents) != 1:
        return None
    child = self.contents[0]
    if isinstance(child, NavigableString):
        return child
    return child.string
Run Code Online (Sandbox Code Playgroud)

如果我们打印出内容,我们可以看到为什么会返回None:

>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]
Run Code Online (Sandbox Code Playgroud)