如何使用BeautifulSoup在两个指定的标签之间获取所有文本？

Question

如何使用BeautifulSoup在两个指定的标签之间获取所有文本？

Ami*_*dav 6 python beautifulsoup html-parsing

html = """
...
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="#all" title="Permalink to this definition">¶</a>
...
"""

Run Code Online (Sandbox Code Playgroud)

我希望big在第一次出现a标记之前获取开始标记之间的所有文本.这意味着如果我采用这个例子,那么我必须得到(iterable)一个字符串.

Answer 1

Jon*_*nts 5

一种迭代的方法.

from BeautifulSoup import BeautifulSoup as bs
from itertools import takewhile, chain

def get_text(html, from_tag, until_tag):
    soup = bs(html)
    for big in soup(from_tag):
        until = big.findNext(until_tag)
        strings = (node for node in big.nextSiblingGenerator() if getattr(node, 'text', '').strip())
        selected = takewhile(lambda node: node != until, strings)
        try:
            yield ''.join(getattr(node, 'text', '') for node in chain([big, next(selected)], selected))
        except StopIteration as e:
            pass

for text in get_text(html, 'big', 'a'):
    print text

Run Code Online (Sandbox Code Playgroud)

Answer 2

ano*_*ave 4

我会避免 nextSibling，因为从你的问题来看，你想要包含直到 next 的所有内容<a>，无论它是在兄弟元素、父元素还是子元素中。

因此，我认为最好的方法是找到下一个<a>元素的节点并递归循环直到那时，添加遇到的每个字符串。如果您的 HTML 与示例有很大不同，您可能需要整理以下内容，但类似这样的内容应该可以工作：

from bs4 import BeautifulSoup
#by taking the `html` variable from the question.
html = BeautifulSoup(html)
firstBigTag = html.find_all('big')[0]
nextATag = firstBigTag.find_next('a')
def loopUntilA(text, firstElement):
    text += firstElement.string
    if (firstElement.next.next == nextATag):             
        return text
    else:
        #Using double next to skip the string nodes themselves
        return loopUntilA(text, firstElement.next.next)
targetString = loopUntilA('', firstBigTag)
print targetString

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，3 月前
查看次数：	9800 次
最近记录：	13 年，3 月前