cas*_*ova 1 python beautifulsoup python-2.7 python-3.x
我正在为不同的新闻媒体创建一个网络刮板.我试图为The Hindu报纸创建一个.
我想从其档案中提到的各种链接中获取新闻.让我们说我想在第二天提到的链接上获取新闻:http://www.thehindu.com/archive/web/2010/06/19/那是2010年6月19日.
现在我写了以下几行代码:
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.contents[0]
print articletext
Run Code Online (Sandbox Code Playgroud)
但我无法得到所需的结果.我基本上卡住了.有人可以帮我解决一下吗?
小智 5
请尝试以下代码:
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
for link in tag_li.findAll('a'):
urlnew = urlnew = link.get('href')
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print re.sub('\s+', ' ', articletext, flags=re.M)
driver.close()
Run Code Online (Sandbox Code Playgroud)
因为re您可能需要导入re模块.