bri*_*enb 4 python urllib2 beautifulsoup
我正在尝试从网站下载和保存讲座视频。虽然我已成功下载文件,但它们无法在我的媒体播放器中播放。这是我正在使用的代码:
from bs4 import BeautifulSoup
import re
import urllib2
snippet = open('Python/SNA Page Source Revised.txt', 'r')
soup = BeautifulSoup(snippet)
links = [link.get('href') for link in soup.find_all('a')]
videos = []
for link in links:
match = re.search('.*mp4.*', link)
if match:
videos.append(link)
vidNum = 1
for video in videos:
f = urllib2.urlopen(video)
with open('Data Analysis/Social Network Analysis/Video '+vidNum+'.mp4', 'wb') as code:
code.write(f.read())
vidNum += 1
Run Code Online (Sandbox Code Playgroud)
一切似乎都正常,但是当我尝试播放其中一个视频时,出现此错误:“Python (v2.7) 需要安装插件才能播放以下类型的媒体文件:text/html 解码器”此外,如果我手动从网站下载视频,文件大约为 22.8MB,但是当我使用我的脚本时,文件只有 7.8kB。
我下载文件的方式有问题吗?任何帮助将不胜感激。
另外:我在使用 Python v2.7 的 Ubuntu 12.04 LTS 操作系统上运行。
****编辑* ***
这是我根据收到的回复使用的代码:
import requests
r = requests.get('https://class.coursera.org/sna-003/lecture/download.mp4?lecture_id=2', auth=('myUsername', 'myPassword'))
with open('Data Analysis/TestFile.mp4', 'wb') as fd:
fd.write(r.content)
Run Code Online (Sandbox Code Playgroud)
这是 r.content 的输出:
<!DOCTYPE html>
<html itemtype="http://schema.org" xmlns:fb="http://ogp.me/ns/fb#"><head><meta content="IE=Edge,chrome=IE7" http-equiv="X-UA-Compatible"/><meta content="!" name="fragment"/><meta content="NOODP" name="robots"/><meta charset="utf-8"/><meta content="Coursera" property="og:title"/><meta content="website" property="og:type"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" property="og:image"/><meta content="https://www.coursera.org/" property="og:url"/><meta content="Coursera" property="og:site_name"/><meta content="en_US" property="og:locale"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." property="og:description"/><meta content="727836538,4807654" property="fb:admins"/><meta content="274998519252278" property="fb:app_id"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." name="description"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" name="image"/><meta content="app-id=736535961" name="apple-itunes-app"/><script>window.onerror = function(message, url, lineNum) {
// First check the URL and line number of the error
url = url || window.location.href;
// 99% of the time, errors without line numbers arent due to our code,
// they are due to third party plugins and browser extensions
if (lineNum === undefined || lineNum == null) return;
// Now figure out the actual error message
// If it's an event, as triggered in several browsers
if (message.target && message.type) {
message = message.type;
}
if (!message.indexOf) {
message = 'Non-string, non-event error: ' + (typeof message);
}
var errorDescrip = {
message: message,
script: url,
line: lineNum,
url: document.URL
}
var err = {
key: 'page.error.javascript',
value: errorDescrip
}
window._204 = window._204 || [];
window._204.push(err);
window._gaq = window._gaq || [];
window._gaq.push(err);
}</script><title>Coursera.org</title><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/css/home.css" rel="stylesheet" type="text/css"/><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/pages/auth/css/auth.css" rel="stylesheet" type="text/css"/><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" id="_mobile">(function(el) {
// Override certian behaviour if the page is for our mobile app.
// TODO(priya) Remove this conditional behaviour once I want to push this behaviour
// for regular authentication pages on mobile/smaller screens as well.
// Currently I'm keeping existing behaviour same and only adding mobile specific
// layouts ot /mobilesignup page (which is what isMobileApp = true signifies).
if ("false" == "true") {
var head = document.getElementsByTagName('head')[0];
// Add viewport meta tag
var viewport = document.querySelector('meta[name=viewport]');
var viewportContent = 'width=device-width, initial-scale=1.0, user-scalable=no';
if (!viewport) {
viewport = document.createElement('meta');
viewport.setAttribute('name', 'viewport');
head.appendChild(viewport);
}
viewport.setAttribute('content', viewportContent);
// Add responsive css
var link = document.createElement('link');
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = el.getAttribute("data-baseurl") + "pages/auth/css/auth_responsive.css";
head.appendChild(link);
}
})(document.getElementById("_mobile"));
</script></head><body><div id="fb-root"></div><div id="origami"><div style="position:absolute;top:0px;left:0px;width:100%;height:100%;background:#f5f5f5;padding-top:5%;"><div id="coursera-loading-nojs" style="text-align:center; margin-bottom:10px;display:none;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div><div><span id="coursera-loading-js" style="display: none; padding-left:45%">loading <img src="https://d2wvvaown1ul17.cloudfront.net/site-static/images/icons/loading.gif"/></span></div><noscript><div style="text-align:center; margin-bottom:10px;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div></noscript></div></div><!--[if gte IE 8]><script>document.getElementById("coursera-loading-js").style.display = 'block';</script><![endif]-->
<!--[if lte IE 7]><script>document.getElementById("coursera-loading-nojs").style.display = 'block';
window._204 = window._204 || [];
window._gaq = window._gaq || [];
window._gaq.push(
['_setAccount', 'UA-28377374-1'],
['_setDomainName', window.location.hostname],
['_setAllowLinker', true],
['_trackPageview', window.location.pathname]);
window._204.push(
['client', 'home'],
{key:"pageview", value:window.location.pathname});
</script><script src="https://eventing.coursera.org/204.min.js"></script><script src="https://ssl.google-analytics.com/ga.js"></script><![endif]-->
<!--[if !IE]> --><script>document.getElementById("coursera-loading-js").style.display = 'block';</script><!-- <![endif]--><script src="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/js/core/require.js" type="text/javascript"></script><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" data-debug="0" data-locale="" data-timestamp="1386838999742" data-version="e47434615f57601f9b9ccaf255a589e8550d328d" id="_require" type="text/javascript">if(document.getElementById("coursera-loading-js").style.display == 'block') {
(function(el) {
// prevent throw
require.onError = function(err) {
window._204 = window._204 || [];
window._204.push({key: 'requireErr', value: err});
};
define("pages/auth/authConfig",
function() {
return {"coursera_url": "https://www.coursera.org/",
"environment": "production"};
}
);
require.config({
enforceDefine: false,
waitSeconds: 14,
baseUrl: el.getAttribute("data-baseurl"),
urlArgs: el.getAttribute("data-debug") == "1" ? "v=" + el.getAttribute("data-timestamp") : "",
shim: {
"underscore": {
exports: '_'
},
"backbone": {
deps: ['underscore', 'jquery'],
exports: 'Backbone'
}
},
paths: {
"jquery": "js/core/jquery",
"underscore": "js/core/underscore",
"backbone": "js/core/backbone",
"i18n": "js/core/i18n._t"
},
callback: function() {
require(["pages/auth/routes"]); // bootup coursera
},
config: {
i18n: {
locale: (window.localStorage ? localStorage.getItem("locale") : '') || el.getAttribute("data-locale")
}
}
});
})(document.getElementById("_require"));
}</script><script type="text/javascript">define("pages/home/models/user.json", [], function(){
return null;
});
</script></body></html>
Run Code Online (Sandbox Code Playgroud)
不过,我觉得这很奇怪,因为它看起来就像网站的源代码,但是当我查看 r.url 时,我得到了一个可以在浏览器中加载的实际网站,它提示我保存或查看视频。即使我尝试传递我从中获得的新 url,我认为它包含我的 cookie 信息,我仍然得到相同的内容。我不明白我哪里出错了。
小智 5
首先,下载并安装请求包。
然后使用此代码:
import requests
def downloadfile(name,url):
name=name+".mp4"
r=requests.get('url')
print "****Connected****"
f=open(name,'wb');
print "Donloading....."
for chunk in r.iter_content(chunk_size=255):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
print "Done"
f.close()
Run Code Online (Sandbox Code Playgroud)
您需要有一个有效的 cookie,这样您就不会下载登录页面。
这是在 urllib2 上设置 cookie 的方法
import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie', 'cookiename=cookievalue'))
f = opener.open("http://example.com/")
Run Code Online (Sandbox Code Playgroud)
您还可以使用cookielib来获得更像网络浏览器的行为来进行登录过程并获取正确的 cookie 来下载您的电影。
另一种方法是使用类似 urllib2 的Requests来进行自动登录过程,只是更简单。