使用beautifulsoup在换行符之间提取文本(例如<br />标签)

mal*_*man 16 html python beautifulsoup html-parsing

我有一个更大的文档中的以下HTML

<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />
Run Code Online (Sandbox Code Playgroud)

我目前正在使用BeautifulSoup来获取HTML中的其他元素,但我还没有找到一种方法来获取<br />标记之间的重要文本行.我可以隔离并导航到每个<br />元素,但无法找到获取文本的方法.任何帮助将不胜感激.谢谢.

Mar*_*air 24

如果您只想要两个<br />标签之间的任何文本,您可以执行以下操作:

from BeautifulSoup import BeautifulSoup, NavigableString, Tag

input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''

soup = BeautifulSoup(input)

for br in soup.findAll('br'):
    next_s = br.nextSibling
    if not (next_s and isinstance(next_s,NavigableString)):
        continue
    next2_s = next_s.nextSibling
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
        text = str(next_s).strip()
        if text:
            print "Found:", next_s
Run Code Online (Sandbox Code Playgroud)

但也许我误解了你的问题?您对问题的描述似乎与示例数据中的"重要"/"非重要"不匹配,所以我已经删除了描述;)


Ken*_*der 7

因此,出于测试目的,我们假设这个HTML块位于span标记内:

x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
Run Code Online (Sandbox Code Playgroud)

现在我要解析它并找到我的span标签:

from BeautifulSoup import BeautifulSoup
y = soup.find('span')
Run Code Online (Sandbox Code Playgroud)

如果你迭代生成器y.childGenerator(),你将获得br和文本:

In [4]: for a in y.childGenerator(): print type(a), str(a)
   ....: 
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 1

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Not Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 2

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 3

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Non Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 4

<type 'instance'> <br />
Run Code Online (Sandbox Code Playgroud)