den*_*nis 3 python curl urllib2 pycurl
我想从网站上获得很多页面,比如
curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"
Run Code Online (Sandbox Code Playgroud)
但是在python中获取页面数据,而不是磁盘文件.有人可以发布pycurl代码来执行此操作,
或者快速urllib2(不是一次一个),如果可能的话,
或者说"忘记它,卷曲更快更强大"?谢谢
所以你有两个问题,让我在一个例子中告诉你.注意pycurl已经完成了多线程/不是一次一个没有你的辛勤工作.
#! /usr/bin/env python
import sys, select, time
import pycurl,StringIO
c1 = pycurl.Curl()
c2 = pycurl.Curl()
c3 = pycurl.Curl()
c1.setopt(c1.URL, "http://www.python.org")
c2.setopt(c2.URL, "http://curl.haxx.se")
c3.setopt(c3.URL, "http://slashdot.org")
s1 = StringIO.StringIO()
s2 = StringIO.StringIO()
s3 = StringIO.StringIO()
c1.setopt(c1.WRITEFUNCTION, s1.write)
c2.setopt(c2.WRITEFUNCTION, s2.write)
c3.setopt(c3.WRITEFUNCTION, s3.write)
m = pycurl.CurlMulti()
m.add_handle(c1)
m.add_handle(c2)
m.add_handle(c3)
# Number of seconds to wait for a timeout to happen
SELECT_TIMEOUT = 1.0
# Stir the state machine into action
while 1:
ret, num_handles = m.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
# Keep going until all the connections have terminated
while num_handles:
# The select method uses fdset internally to determine which file descriptors
# to check.
m.select(SELECT_TIMEOUT)
while 1:
ret, num_handles = m.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
# Cleanup
m.remove_handle(c3)
m.remove_handle(c2)
m.remove_handle(c1)
m.close()
c1.close()
c2.close()
c3.close()
print "http://www.python.org is ",s1.getvalue()
print "http://curl.haxx.se is ",s2.getvalue()
print "http://slashdot.org is ",s3.getvalue()
Run Code Online (Sandbox Code Playgroud)
最后,这些代码主要基于pycurl site =.=上的示例
可能你应该真正阅读doc.人们花了很多时间在上面.
| 归档时间: |
|
| 查看次数: |
6676 次 |
| 最近记录: |