用BeautifulSoup和多个段落刮痧

use*_*057 9 python beautifulsoup web-scraping

我正在尝试使用BeautifulSoup从网站上发表演讲.然而,我遇到了问题,因为演讲分为许多不同的段落.我对编程非常陌生,并且无法弄清楚如何处理这个问题.页面的HTML如下所示:

<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney, 
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is    
at war; our economy is in recession; and the civilized world faces unprecedented dangers. 
Yet, the state of our Union has never been stronger.
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, 
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and  
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, 
saved a people from starvation, and freed a country from brutal oppression. 
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied 
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to 
sacrifice their lives are running for their own.
Run Code Online (Sandbox Code Playgroud)

它会持续一段时间,带有多个段落标记.我正在尝试提取范围内的所有文本.

我尝试了几种不同的方法来获取文本,但两者都未能获得我想要的文本.

我尝试的第一个是:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
print thespan.string
Run Code Online (Sandbox Code Playgroud)

这给了我:

议长先生,副总统切尼,国会议员,贵宾,同胞们:今晚我们聚会,我们的国家正处于战争状态; 我们的经济陷入衰退; 文明世界面临前所未有的危险.然而,我们联盟的状态从未如此强大.

这是直到第一段标记的文本部分.然后我尝试了:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
for section in thespan:
     paragraph = section.findNext('p')
     if paragraph and paragraph.string:
         print '>', paragraph.string
     else:
         print '>', section.parent.next.next.strip()
Run Code Online (Sandbox Code Playgroud)

这给了我第一段标签和第二段标签之间的文字.所以,我正在寻找一种方法来获取整个文本,而不仅仅是部分.

Sha*_*hin 8

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())

span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
paras = [x.contents[0] for x in span.findAllNext("p")]  # this gives you the rest
# use .contents[0] instead of .string to deal with last para that's not well formed

print "%s\n\n%s" % (span.string, "\n\n".join(paras))
Run Code Online (Sandbox Code Playgroud)

正如评论中指出的那样,如果<p>标签包含更多嵌套标签,则上述方法效果不佳.这可以使用以下方法处理:

paras = ["".join(x.findAll(text=True)) for x in span.findAllNext("p")]
Run Code Online (Sandbox Code Playgroud)

但是,对于没有<p>结束标记的最后一个,这不能很好地工作.一个hacky解决方法是以不同的方式对待它.例如:

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())
span = soup.find("span", {"class":"displaytext"})  
paras = [x for x in span.findAllNext("p")]

start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s" % (start, middle, last)
Run Code Online (Sandbox Code Playgroud)