Jam*_*mus 2 python parsing urllib urllib2 beautifulsoup
我正在使用BS4和python2.7.这是我的代码的开始(谢谢root):
from bs4 import BeautifulSoup
import urllib2
f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)
Run Code Online (Sandbox Code Playgroud)
当我打印html时,其内容与chrome中查看的页面源相同.然而,当我打印汤时,它切断了整个身体并留下了这个(头标记的内容):
<!DOCTYPE html>
<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
var webRoot = 'http://yify-torrents.com/';
var IsLoggedIn = 0 </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html>
Run Code Online (Sandbox Code Playgroud)
我哪里错了?!
我有同样的问题,这解决了我的问题:
soup = BeautifulSoup(html, 'html5lib')
Run Code Online (Sandbox Code Playgroud)
你需要安装html5lib:
pip install html5lib
Run Code Online (Sandbox Code Playgroud)
要么
easy_install html5lib
Run Code Online (Sandbox Code Playgroud)
你可以在这里阅读更多关于Beautiful Soup的不同解析器(优点和缺点):
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
| 归档时间: |
|
| 查看次数: |
2128 次 |
| 最近记录: |