use*_*763 3 beautifulsoup python-2.7
我正在尝试获取文章正文中的所有 p 标签。我想知道是否有人可以解释为什么我的代码是错误的以及我如何改进它。下面是文章的网址和相关代码。感谢您提供的任何见解。
网址:http ://www.france24.com/en/20140310-libya-seize-north-korea-crude-oil-tanker-rebels-port-rebels/
import urllib2
from bs4 import BeautifulSoup
# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")
soup = BeautifulSoup(urllib2.urlopen(url).read())
# retrieve all of the paragraph tags
body = soup.find("div", {'class':'bd'}).get_text()
for tag in body:
p = soup.find_all('p')
print str(p) + '\n' + '\n'
Run Code Online (Sandbox Code Playgroud)
问题是页面上有多个div标签。class="bd"看起来您需要包含一篇实际文章的文章 - 它位于article标签内部:
import urllib2
from bs4 import BeautifulSoup
# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")
soup = BeautifulSoup(urllib2.urlopen(url))
# retrieve all of the paragraph tags
paragraphs = soup.find('article').find("div", {'class': 'bd'}).find_all('p')
for paragraph in paragraphs:
print paragraph.text
Run Code Online (Sandbox Code Playgroud)
印刷:
Libyan government forces on Monday seized a North Korea-flagged tanker after...
...
Run Code Online (Sandbox Code Playgroud)
希望有帮助。
| 归档时间: |
|
| 查看次数: |
8453 次 |
| 最近记录: |