标签: urllib2

用户身份验证和Python中的文本解析

好吧,我正在制作一个多阶段的程序......我无法完成第一阶段的工作.我想要做的是登录Twitter.com,然后阅读用户页面上的所有直接消息.

最终我将阅读所有寻找某些事情的直接消息,但这应该不难.

到目前为止这是我的代码

import urllib
import urllib2
import httplib
import sys

userName = "notmyusername"
password  = "notmypassword"
URL = "http://twitter.com/#inbox"

password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, "http://twitter.com/", userName, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
pageshit = urllib2.urlopen(URL, "80").readlines()
print pageshit
Run Code Online (Sandbox Code Playgroud)

因此,对我所做的错误的一点见解和帮助将会非常有帮助.

python authentication http urllib2

1
推荐指数
1
解决办法
1320
查看次数

使用python-ntlm的IndexError

我正在尝试使用urllib2python-ntlm连接到NT身份验证的服务器,但我收到一个错误.这是我正在使用的代码,来自python-ntlm站点:

user = 'DOMAIN\user.name'
password = 'Password123'
url = 'http://corporate.domain.com/page.aspx?id=foobar'

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)

# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)

# retrieve the result
response = urllib2.urlopen(url)
return response.read()
Run Code Online (Sandbox Code Playgroud)

这是我得到的错误:

Traceback (most recent call last):
  File "C:\Python27\test.py", line 112, in get_ntlm_data
    response = urllib2.urlopen(url)
  File "C:\Python27\lib\urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 398, …
Run Code Online (Sandbox Code Playgroud)

python ntlm urllib2

1
推荐指数
1
解决办法
1239
查看次数

Python在循环中下载多个文件

我的代码有问题.

#!/usr/bin/env python3.1

import urllib.request;

# Disguise as a Mozila browser on a Windows OS
userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)';

URL = "www.example.com/img";
req = urllib.request.Request(URL, headers={'User-Agent' : userAgent});

# Counter for the filename.
i = 0;

while True:
    fname =  str(i).zfill(3) + '.png';
    req.full_url = URL + fname;

    f = open(fname, 'wb');

    try:
        response = urllib.request.urlopen(req);
    except:
        break;
    else:
        f.write(response.read());
        i+=1;
        response.close();
    finally:
        f.close();
Run Code Online (Sandbox Code Playgroud)

当我创建urllib.request.Request对象(称为req)时,问题似乎就出现了.我用一个不存在的url创建它,但后来我将url更改为它应该是什么.我这样做是为了让我可以使用相同的urllib.request.Request对象,而不必在每次迭代时创建新的对象.可能有一种机制可以在python中做到这一点,但我不确定它是什么.

编辑 错误消息是:

>>> response = urllib.request.urlopen(req);
Traceback (most recent call last): …
Run Code Online (Sandbox Code Playgroud)

python linux urllib2

1
推荐指数
1
解决办法
5466
查看次数

多线程以加快下载速度

如何同时下载多个链接?我下面的脚本有效,但一次只下载一个,速度非常慢.我无法弄清楚如何在我的脚本中加入多线程.

Python脚本:

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)
Run Code Online (Sandbox Code Playgroud)

HTML文件名为links.html:

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
Run Code Online (Sandbox Code Playgroud)

python lxml urllib urllib2 beautifulsoup

1
推荐指数
1
解决办法
1万
查看次数

python urllib2无法获取谷歌网址

我在使用python的urllib2获取此URL的结果页面时非常艰难:

    http://www.google.com/search?tbs=sbi:AMhZZitAaz7goe6AsfVSmFw1sbwsmX0uIjeVnzKHjEXMck70H3j32Q-6FApxrhxdSyMo0OedyWkxk3-qYbyf0q1OqNspjLu8DlyNnWVbNjiKGo87QUjQHf2_1idZ1q_1vvm5gzOCMpChYiKsKYdMywOLjJzqmzYoJNOU2UsTs_1zZGWjU-LsjdFXt_1D5bDkuyRK0YbsaLVcx4eEk_1KMkcJpWlfFEfPMutxTLGf1zxD-9DFZDzNOODs0oj2j_1KG8FRCaMFnTzAfTdl7JfgaDf_1t5Vti8FnbeG9i7qt9wF6P-QK9mdvC15hZ5UR29eQdYbcD1e4woaOQCmg8Q1VLVPf4-kf8dAI7p3jM_1MkBBwaxdt_1TsM4FLwh0oHAYKOS5qBRI28Vs0aw5_1C5-WR4dC902Eqm5eAkLiQyAM9J2bioR66g3tMWe-j9Hyh1ID40R1NyXEJDHcGxp7xOn_16XxfW_1Cq5ArdSNzxFvABb1UcXCn5s4_1LpXZxhZbauwaO8cg3CKGLUvl_1wySDB7QIkMIF2ZInEPS4K-eyErVKqOdY9caYUD8X7oOf6sDKFjT7pNHwlkXiuYbKBRYjlvRHPlcPN1WHWCJWdSNyXdZhwDI3VRaKwmi4YNvkryeNMMbhGytfvlNaaelKcOzWbvzCtSNaP2lJziN1x3btcIAplPcoZxEpb0cDlQwId3A5FDhczxpVbdRnOB-Xeq_1AiUTt_1iI6bSgUAinWXQFYWveTOttdSNCgK-VTxV4OCtlrCrZerk27RBLAzT0ol9NOfYmYhiabzhUczWk4NuiVhKN-M4eo76cAsi74PY4V_1lWjvOpI35V_1YLJQrm0fxVcD34wxFYCIllT2gYW09fj3cuBDMNbsaJqPVQ04OOGlwmcmJeAnK96xd_1aMUd6FsVLOSDS7RfS5MNUSyd1jnXvRU_1MF_1Dj8oC8sm7PfVdjm3firiMcaKM28j9kGWbY0heIGLtO_1m6ad-iKfxYEzSux2b5w62LQlP57yS7vX8RFoyKzHA0RrFIEbPBQdNMA3Vpw0G_1LvEjCAPSCV1HH1pDp0l4EnNCvUIAppVXzNMyWT_1gKITj1NLqAn-Z1tH323JwZSc77OftDSreyHJ-BPxn3n7JMkNZFcQx6S7tfBxeqJ1NuDlpax11pw0_1Oi_1nF3vyEP0NbGKSVgNvBv_1tv8ahxvrHn9UnP78FleiOpzUBfdfRPZiT20VEq5-oXtV_1XwIzrd-5_15-cf2yoL7ohyPuv3WKGUGr4YCsYje7_1D8VslqMPsvbwMg9haj3TrBKH7go70ZfPjUv3h1K7lplnnCdV0hrYVQkSLUY1eEor3L--Vu5PlewS60ZH5YEn4qTnDxniV95h8q0Y3RWXJ6gIXitR5y6CofVg
Run Code Online (Sandbox Code Playgroud)

我使用以下标题,这应该是简单的我会想:

    headers = {'Host':'www.google.com','User-Agent':user_agent,'Accept-Language':'en-us,en;q=0.5','Accept-Encoding':'gzip, deflate','Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.7','Connection':'keep-alive','Referer':'http://www.google.co.in/imghp?hl=en&tab=ii','Cookie':'PREF=ID=1d7bc4ff2a5d8bc6:U=1d37ba5a518b9be1:FF=4:LD=en:TM=1300950025:LM=1302071720:S=rkk0IbbhxUIgpTyA; NID=51=uNq6mZ385WlV1UTfXsiWkSgnsa6PdjH4l9ph-vSQRszBHRcKW3VRJclZLd2XUEdZtxiCtl5hpbJiS3SpEV7670w_x738h75akcO6Viw47MUlpCZfy4KZ2vLT4tcleeiW; SID=DQAAAMEAAACoYm-3B2aiLKf0cRU8spJuiNjiXEQRyxsUZqKf8UXZXS55movrnTmfEcM6FYn-gALmyMPNRIwLDBojINzkv8doX69rUQ9-'}
Run Code Online (Sandbox Code Playgroud)

当我执行以下操作时,我得到的结果不包含任何普通Web浏览器返回的内容:

    request=urllib2.Request(url,,None,headers)
    response=urllib2.urlopen(request)
    html=response.read()
Run Code Online (Sandbox Code Playgroud)

同样,这段代码返回一堆我读不懂的十六进制垃圾:

    request=urllib2.Request(url,headers=headers)
    response=urllib2.urlopen(request)
    html=response.read()
Run Code Online (Sandbox Code Playgroud)

请帮助,因为我很确定这很简单,我一定要错过一些东西.我能够以类似的方式获取此链接,还可以使用以下代码将图像上传到ima​​ges.google.com:

    import httplib, mimetypes, android, sys, urllib2, urllib, simplejson

    def post_multipart(host, selector, fields, files):
        """
        Post fields and files to an http host as multipart/form-data.
        fields is a sequence of (name, value) elements for regular form fields.
        files is a sequence of (name, filename, value) elements for data to be uploaded as files
        Return the server's response page.
        """
        content_type, body = encode_multipart_formdata(fields, …
Run Code Online (Sandbox Code Playgroud)

python url urllib urllib2

1
推荐指数
1
解决办法
5811
查看次数

发出HTTP [S]请求的首选方法

我需要使用POST,GET和其他方法发出HTTP和HTTPS请求,并指定标头和超时.

在互联网上有很多例子,它们都是不同的:

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
Run Code Online (Sandbox Code Playgroud)

要么

    fetcher = urllib2.build_opener()
    fetcher.addheaders.append(('Cookie', 'aaaa=%s' % aaaa))
    res = fetcher.open(settings.ABC_URL)
Run Code Online (Sandbox Code Playgroud)

要么

req = urllib2.Request(url=url)
req.add_header('X-Real-IP', request.META['REMOTE_ADDR'])
req.add_header('Cookie', request.META['HTTP_COOKIE'])
req.add_header('User-Agent', request.META['HTTP_USER_AGENT'])
resp = urllib2.urlopen(req).read()
Run Code Online (Sandbox Code Playgroud)

要么

handler = urllib.urlopen('http://...')
response …
Run Code Online (Sandbox Code Playgroud)

python http urllib urllib2 httprequest

1
推荐指数
1
解决办法
1118
查看次数

使用urllib2从基于身份验证的Jenkins服务器获取URL

我正在尝试从Jekins服务器获取URL.直到最近,我才能使用此页面上描述的模式(HOWTO使用urllib2获取Internet资源)来创建一个密码管理器,该管理器使用用户名和密码正确响应BasicAuth挑战.一切都很好,直到Jenkins团队改变他们的安全模型,并且该代码不再有效.

# DOES NOT WORK!
import urllib2
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
top_level_url = "http://localhost:8080"

password_mgr.add_password(None, top_level_url, 'sal', 'foobar')
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)

a_url = 'http://localhost:8080/job/foo/4/api/python'
print opener.open(a_url).read()
Run Code Online (Sandbox Code Playgroud)

堆栈跟踪:

Traceback (most recent call last):
  File "/home/sal/workspace/jenkinsapi/src/examples/password.py", line 11, in <module>
    print opener.open(a_url).read()
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, …
Run Code Online (Sandbox Code Playgroud)

python authentication http urllib2 jenkins

1
推荐指数
1
解决办法
7899
查看次数

为什么我的非常简单的Python脚本失败了?

noob.py这里.我正在尝试从页面获取内容但该print语句引发了一个我不理解的错误.

实际代码:

import urllib2
import sys

url = "http://make.wordpress.org/core/page/2/"
response = urllib2.urlopen(url)
html = response.read
print html
Run Code Online (Sandbox Code Playgroud)

输出:

$ python get.py
<bound method _fileobject.read of <socket._fileobject object at 0x3722ec9a8d0>>
Run Code Online (Sandbox Code Playgroud)

我怀疑Python不喜欢那个特定的URL,因为它可以使用,http://www.python.org相反,但我可以获得任何有用的信息来理解它.

我没有得到任何的是,如果我附上此内try:except:? pass,我仍然得到该错误消息.

任何指针都很受欢迎.

python urllib2

1
推荐指数
1
解决办法
112
查看次数

Python,urllib2模块中的NameError,但仅限于少数几个网站

website = raw_input('website: ')
with open('words.txt', 'r+') as arquivo:
    for lendo in arquivo.readlines():
        msmwebsite = website + lendo
        try:
            abrindo = urllib2.urlopen(msmwebsite)
            abrindo2 = abrindo.read()           

        except URLError as e:
            pass

        if abrindo.code == 200:
            palavras = ['registration', 'there is no form']
            for palavras2 in palavras:
                if palavras2 in abrindo2:
                    print msmwebsite, 'up'

                else:
                    pass

        else:
            pass
Run Code Online (Sandbox Code Playgroud)

它工作但由于某种原因,一些网站我收到此错误:

if abrindo.code == 200:
NameError: name 'abrindo' is not defined
Run Code Online (Sandbox Code Playgroud)

怎么解决?.................................................. .................................................. .................................................. .................................

python urllib2 nameerror

1
推荐指数
1
解决办法
52
查看次数

类继承问题

我正在尝试创建一个扩展HTTPBasicAuthHandler的类.出于某种原因,我在旧代码中使用的相同方法在这里不起作用.

class AuthInfo(urllib2.HTTPBasicAuthHandler):
    def __init__(self, realm, url, username, password):
        self.pwdmgr     = urllib2.HTTPPasswordMgrWithDefaultRealm()
        self.pwdmgr.add_password(None, url, username, password)
        super(AuthInfo, self).__init__(self.pwdmgr)
Run Code Online (Sandbox Code Playgroud)

这是错误:

Traceback (most recent call last):
  File "./RestResult.py", line 67, in ?
    auth = AuthInfo(None, "default", "xxxxx", "xxxxxxxx")
  File "./RestResult.py", line 47, in __init__
    super(AuthInfo, self).__init__(self.pwdmgr)
TypeError: super() argument 1 must be type, not classobj
Run Code Online (Sandbox Code Playgroud)

python urllib2 python-2.4

1
推荐指数
1
解决办法
49
查看次数