标签: beautifulsoup

多线程以加快下载速度

如何同时下载多个链接?我下面的脚本有效,但一次只下载一个,速度非常慢.我无法弄清楚如何在我的脚本中加入多线程.

Python脚本:

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)
Run Code Online (Sandbox Code Playgroud)

HTML文件名为links.html:

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
Run Code Online (Sandbox Code Playgroud)

python lxml urllib urllib2 beautifulsoup

1
推荐指数
1
解决办法
1万
查看次数

获取标签列表并获取BeautifulSoup中的属性值

我正在尝试使用BeautifulSoup,因此获取HTML <div>标记列表,然后检查它们是否具有name属性,然后返回该属性值.请看我的代码:

soup = BeautifulSoup(html) #assume html contains <div> tags with a name attribute
nameTags = soup.findAll('name') 
for n in nameTags:
    if n.has_key('name'):
       #get the value of the name attribute
Run Code Online (Sandbox Code Playgroud)

我的问题是如何获取name属性的值?

python beautifulsoup

1
推荐指数
1
解决办法
8363
查看次数

Python /解析:BeautifulSoup错误"模块obj不可调用",其结果来自Mechanize

更新:哇,你们所有人都是对的!
由于我还不明白的原因,我需要:"来自BeautifulSoup导入BeautifulSoup"并添加行:

response = br.submit()
print type(response) #new line
raw = br.response().read()#new line
print type(raw)#new line
print type(br.response().read())#new line
cooked = (br.response().read())#new line
soup = BeautifulSoup(cooked)
Run Code Online (Sandbox Code Playgroud)

/更新

嗯,BeautifulSoup和我没有认识到br.response().read()的结果.我已经进口了BeautifulSoup ......

#snippet:
# Select the first (index zero) form
br.select_form(nr=0)
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = list_of_dates[0]
br['__EVENTARGUMENT'] = 'calMain'
br['__VIEWSTATE'] = viewstate
br['__EVENTVALIDATION'] = eventvalidation

response = br.submit()
print br.response().read() #*#this prints the html I'm expecting*

soup = BeautifulSoup(br.response().read()) #*#but this throws 
#TypeError: 'module' object is not callable.  
#Yet if I call soup = …
Run Code Online (Sandbox Code Playgroud)

python mechanize beautifulsoup

1
推荐指数
1
解决办法
2526
查看次数

不存在的行Python出错

我不知道该怎么做,我有一个39行的Python脚本,它在第40行给我一个错误!:(错误:

Traceback (most recent call last):
File "C:\Mass Storage\pythonscripts\Internet\execute.py", line 2, in <module>
execfile("firstrunSoup.py")
File "firstrunSoup.py", line 40

                                ^
SyntaxError: invalid syntax

C:\Mass Storage\pythonscripts\Internet>
Run Code Online (Sandbox Code Playgroud)

这是我的Python代码:

###firstrunSoup.py###
FILE = open("startURL","r") #Grab from
stURL = FILE.read() #Read first line
FILE.close() #Close
file2save = "index.txt" #File to save URLs to

jscriptV = "not"
try:
    #Returns true/false for absolute
    def is_absolute(url):
        return bool(urlparse.urlparse(url).scheme)

    #Imports
    import urllib2,sys,time,re,urlparse
    from bs4 import BeautifulSoup

    cpURL = urllib2.urlopen(stURL) #Human-readable to computer-usable
    soup = BeautifulSoup(cpURL) #Defines soup

    FILE = open(file2save,"a") …
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup python-2.7

1
推荐指数
1
解决办法
785
查看次数

如何使用beautifulsoup在html代码中添加背景颜色?

使用beautifulsoup我得到一个网站的HTML代码,让我们说这是:

<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)

如何使用beautifulsoup body {background-color:#b0c4de;}head标签内添加此行?

让我们说python代码是:

#!/usr/bin/python

import cgi, cgitb, urllib2, sys
from bs4 import BeautifulSoup

site = "www.example.com"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

1
推荐指数
1
解决办法
1830
查看次数

简单的网络爬虫

我在python写了下面的程序很简单的网络爬虫,但是当我运行它,它返回我"NoneType"对象不是可调用的",你能帮帮我吗?

import BeautifulSoup
import urllib2
def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)

def crawler(SeedUrl):
    tocrawl=[SeedUrl]
    crawled=[]
    while tocrawl:
        page=tocrawl.pop()
        pagesource=urllib2.urlopen(page)
        s=pagesource.read()
        soup=BeautifulSoup.BeautifulSoup(s)
        links=soup('a')        
        if page not in crawled:
            union(tocrawl,links)
            crawled.append(page)

    return crawled
crawler('http://www.princeton.edu/main/')
Run Code Online (Sandbox Code Playgroud)

beautifulsoup python-2.7

1
推荐指数
1
解决办法
1万
查看次数

为什么我不能使用python 2.7在使用pip和/或easy_install安装后使用python 2.7导入beautifulsoup?

$ sudo pip install beautifulsoup4
Requirement already satisfied (use --upgrade to upgrade): beautifulsoup4 in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
Cleaning up...
Run Code Online (Sandbox Code Playgroud)

我已安装beautifulsoup4并且似乎已成功完成但我无法导入它:

Python 2.7.3 (v2.7.3:70, Apr  9 2012, 20:52:43) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import beautifulsoup4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named beautifulsoup4
>>> import beautifulsoup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: …
Run Code Online (Sandbox Code Playgroud)

python pip beautifulsoup easy-install python-2.7

1
推荐指数
1
解决办法
2178
查看次数

Beautifulsoup失去了节点

我正在使用Python和Beautifulsoup来解析HTML-Data并从RSS-Feeds中获取p-tags.但是,一些URL会导致问题,因为解析的汤对象不包括文档的所有节点.

例如,我试图解析http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm

但是在将解析后的对象与页面源代码进行比较后,我注意到之后的所有节点ul class="nextgen-left"都丢失了.

以下是我解析文档的方法:

from bs4 import BeautifulSoup as bs

url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)

response = opener.open(request) 

soup = bs(response,'lxml')        
print soup
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup html5lib

1
推荐指数
1
解决办法
1241
查看次数

从网站上刮取表格数据

我试图使用BeautifulSoup4和Python从网站上刮取表格数据,然后用结果创建一个Excel文档.到目前为止,我有这个:

import urllib2
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://opl.tmhp.com/ProviderManager/SearchResults.aspx?TPI=&OfficeHrs=4&ProgType=STAR&UCCIndicator=No+Preference&Cnty=&NPI=&Srvs=6&Age=All&Gndr=B&SortBy=Distance&ZipCd=78552&SrvsOfrd=0&SpecCd=0&Name=&CntySrvd=0&Plan=H3&WvrProg=0&SubSpecCd=0&AcptPnt=Y&Rad=200&LangCd=99').read())

for row in soup('table', {'class' : 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string
Run Code Online (Sandbox Code Playgroud)

但它无法显示数据.

有任何想法吗?

python beautifulsoup web-scraping

1
推荐指数
1
解决办法
834
查看次数

Python中的关键字参数

所以我试图使用urllib2/BeautifulSoup从维基百科页面读取数据.我将此代码复制到终端:

import urllib2

hdrs = { 'User-Agent': "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11" } 
req = urllib2.Request("http://en.wikipedia.org/wiki/List_of_United_States_mobile_phone_companies" , headers = hdrs)
fd = urllib2.urlopen(req) 
Run Code Online (Sandbox Code Playgroud)

它工作正常.但是,当我进行此调用(删除关键字参数)时:

req = urllib2.Request("http://en.wikipedia.org/wiki/List_of_United_States_mobile_phone_companies" , hdrs)
Run Code Online (Sandbox Code Playgroud)

我收到一个错误:

 TypeError: must be string or buffer, not dict
Run Code Online (Sandbox Code Playgroud)

为什么会这样?我认为关键字参数在函数调用中是可选的.谢谢您的帮助!

python urllib2 beautifulsoup keyword-argument

1
推荐指数
1
解决办法
517
查看次数