如何同时下载多个链接?我下面的脚本有效,但一次只下载一个,速度非常慢.我无法弄清楚如何在我的脚本中加入多线程.
Python脚本:
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)
Run Code Online (Sandbox Code Playgroud)
HTML文件名为links.html:
<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用BeautifulSoup,因此获取HTML <div>标记列表,然后检查它们是否具有name属性,然后返回该属性值.请看我的代码:
soup = BeautifulSoup(html) #assume html contains <div> tags with a name attribute
nameTags = soup.findAll('name')
for n in nameTags:
if n.has_key('name'):
#get the value of the name attribute
Run Code Online (Sandbox Code Playgroud)
我的问题是如何获取name属性的值?
更新:哇,你们所有人都是对的!
由于我还不明白的原因,我需要:"来自BeautifulSoup导入BeautifulSoup"并添加行:
response = br.submit()
print type(response) #new line
raw = br.response().read()#new line
print type(raw)#new line
print type(br.response().read())#new line
cooked = (br.response().read())#new line
soup = BeautifulSoup(cooked)
Run Code Online (Sandbox Code Playgroud)
/更新
嗯,BeautifulSoup和我没有认识到br.response().read()的结果.我已经进口了BeautifulSoup ......
#snippet:
# Select the first (index zero) form
br.select_form(nr=0)
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = list_of_dates[0]
br['__EVENTARGUMENT'] = 'calMain'
br['__VIEWSTATE'] = viewstate
br['__EVENTVALIDATION'] = eventvalidation
response = br.submit()
print br.response().read() #*#this prints the html I'm expecting*
soup = BeautifulSoup(br.response().read()) #*#but this throws
#TypeError: 'module' object is not callable.
#Yet if I call soup = …Run Code Online (Sandbox Code Playgroud) 我不知道该怎么做,我有一个39行的Python脚本,它在第40行给我一个错误!:(错误:
Traceback (most recent call last):
File "C:\Mass Storage\pythonscripts\Internet\execute.py", line 2, in <module>
execfile("firstrunSoup.py")
File "firstrunSoup.py", line 40
^
SyntaxError: invalid syntax
C:\Mass Storage\pythonscripts\Internet>
Run Code Online (Sandbox Code Playgroud)
这是我的Python代码:
###firstrunSoup.py###
FILE = open("startURL","r") #Grab from
stURL = FILE.read() #Read first line
FILE.close() #Close
file2save = "index.txt" #File to save URLs to
jscriptV = "not"
try:
#Returns true/false for absolute
def is_absolute(url):
return bool(urlparse.urlparse(url).scheme)
#Imports
import urllib2,sys,time,re,urlparse
from bs4 import BeautifulSoup
cpURL = urllib2.urlopen(stURL) #Human-readable to computer-usable
soup = BeautifulSoup(cpURL) #Defines soup
FILE = open(file2save,"a") …Run Code Online (Sandbox Code Playgroud) 使用beautifulsoup我得到一个网站的HTML代码,让我们说这是:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
如何使用beautifulsoup body {background-color:#b0c4de;}在head标签内添加此行?
让我们说python代码是:
#!/usr/bin/python
import cgi, cgitb, urllib2, sys
from bs4 import BeautifulSoup
site = "www.example.com"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
Run Code Online (Sandbox Code Playgroud) 我在python写了下面的程序很简单的网络爬虫,但是当我运行它,它返回我"NoneType"对象不是可调用的",你能帮帮我吗?
import BeautifulSoup
import urllib2
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def crawler(SeedUrl):
tocrawl=[SeedUrl]
crawled=[]
while tocrawl:
page=tocrawl.pop()
pagesource=urllib2.urlopen(page)
s=pagesource.read()
soup=BeautifulSoup.BeautifulSoup(s)
links=soup('a')
if page not in crawled:
union(tocrawl,links)
crawled.append(page)
return crawled
crawler('http://www.princeton.edu/main/')
Run Code Online (Sandbox Code Playgroud) $ sudo pip install beautifulsoup4
Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
Cleaning up...
Run Code Online (Sandbox Code Playgroud)
我已安装beautifulsoup4并且似乎已成功完成但我无法导入它:
Python 2.7.3 (v2.7.3:70, Apr 9 2012, 20:52:43)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import beautifulsoup4
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named beautifulsoup4
>>> import beautifulsoup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: …Run Code Online (Sandbox Code Playgroud) 我正在使用Python和Beautifulsoup来解析HTML-Data并从RSS-Feeds中获取p-tags.但是,一些URL会导致问题,因为解析的汤对象不包括文档的所有节点.
例如,我试图解析http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm
但是在将解析后的对象与页面源代码进行比较后,我注意到之后的所有节点ul class="nextgen-left"都丢失了.
以下是我解析文档的方法:
from bs4 import BeautifulSoup as bs
url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)
soup = bs(response,'lxml')
print soup
Run Code Online (Sandbox Code Playgroud) 我试图使用BeautifulSoup4和Python从网站上刮取表格数据,然后用结果创建一个Excel文档.到目前为止,我有这个:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://opl.tmhp.com/ProviderManager/SearchResults.aspx?TPI=&OfficeHrs=4&ProgType=STAR&UCCIndicator=No+Preference&Cnty=&NPI=&Srvs=6&Age=All&Gndr=B&SortBy=Distance&ZipCd=78552&SrvsOfrd=0&SpecCd=0&Name=&CntySrvd=0&Plan=H3&WvrProg=0&SubSpecCd=0&AcptPnt=Y&Rad=200&LangCd=99').read())
for row in soup('table', {'class' : 'spad'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
Run Code Online (Sandbox Code Playgroud)
但它无法显示数据.
有任何想法吗?
所以我试图使用urllib2/BeautifulSoup从维基百科页面读取数据.我将此代码复制到终端:
import urllib2
hdrs = { 'User-Agent': "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11" }
req = urllib2.Request("http://en.wikipedia.org/wiki/List_of_United_States_mobile_phone_companies" , headers = hdrs)
fd = urllib2.urlopen(req)
Run Code Online (Sandbox Code Playgroud)
它工作正常.但是,当我进行此调用(删除关键字参数)时:
req = urllib2.Request("http://en.wikipedia.org/wiki/List_of_United_States_mobile_phone_companies" , hdrs)
Run Code Online (Sandbox Code Playgroud)
我收到一个错误:
TypeError: must be string or buffer, not dict
Run Code Online (Sandbox Code Playgroud)
为什么会这样?我认为关键字参数在函数调用中是可选的.谢谢您的帮助!
beautifulsoup ×10
python ×9
python-2.7 ×3
urllib2 ×2
easy-install ×1
html5lib ×1
lxml ×1
mechanize ×1
pip ×1
urllib ×1
web-scraping ×1