Bla*_*ner 1 python lxml urllib urllib2 beautifulsoup
如何同时下载多个链接?我下面的脚本有效,但一次只下载一个,速度非常慢.我无法弄清楚如何在我的脚本中加入多线程.
Python脚本:
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)
Run Code Online (Sandbox Code Playgroud)
HTML文件名为links.html:
<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
Run Code Online (Sandbox Code Playgroud)
我multiprocessing用来并行化东西 - 出于某些原因我比它更喜欢它threading
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
import multiprocessing
print ("downloading and parsing Bibles...")
def download_stuff(link):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)
root = html.parse(open('links.html'))
links = root.findall('//a')
pool = multiprocessing.Pool(processes=5) #use 5 processes to download the data
output = pool.map(download_stuff,links) #output is a list of [None,None,...] since download_stuff doesn't return anything
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
11914 次 |
| 最近记录: |