标签: beautifulsoup

BeautifulSoup:AttributeError:'NavigableString'对象没有属性'name'

你知道为什么BeautifulSoup教程中的第一个例子http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart给出了AttributeError: 'NavigableString' object has no attribute 'name'吗？根据这个答案,HTML中的空格字符会导致问题.我尝试了几页的来源和1个工作,其他人给出了同样的错误(我删除了空格).你能解释"名称"所指的是什么以及为什么会发生这种错误吗？谢谢.

python beautifulsoup

16
推荐指数

3
解决办法

4万
查看次数

如何重新安装lxml？

我在mac 10.7.5上使用python 2,7.5,beautifulsoup 4.2.1.我将使用lxml库解析xml页面,如beautifulsoup教程中所述.但是,当我运行我的代码时,它会显示出来

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested:
lxml,xml. Do you need to install a parser library?

Run Code Online (Sandbox Code Playgroud)

我确信我已经通过所有方法安装了lxml:easy_install,pip,port等.我试着在我的代码中添加一行,看看是否安装了lxml:

import lxml

Run Code Online (Sandbox Code Playgroud)

然后python可以成功浏览此代码并再次显示上一条错误消息,发生在同一行.

所以我很确定已经安装了lxml,但没有正确安装.所以我决定卸载lxml,然后使用'正确'方法重新安装.但是当我输入时

easy_install -m  lxml

Run Code Online (Sandbox Code Playgroud)

表明:

Searching for lxml
Best match: lxml 3.2.1
Processing lxml-3.2.1-py2.7-macosx-10.6-intel.egg

Using /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lxml-
3.2.1-py2.7-macosx-10.6-intel.egg

Because this distribution was installed --multi-version, before you can
import modules from this package in an application, you will need to
'import pkg_resources' and then use a 'require()' call similar to one of
these examples, …

Run Code Online (Sandbox Code Playgroud)

python lxml beautifulsoup easy-install

16
推荐指数

3
解决办法

3万
查看次数

Python请求:requests.exceptions.TooManyRedirects:超过30个重定向

我试图使用python-requests库抓取此页面

import requests
from lxml import etree,html

url = 'http://www.amazon.in/b/ref=sa_menu_mobile_elec_all?ie=UTF8&node=976419031'
r = requests.get(url)
tree = etree.HTML(r.text)
print tree

Run Code Online (Sandbox Code Playgroud)

但我得到了上述错误.(TooManyRedirects)我试图使用allow_redirects参数但同样的错误

r = requests.get(url, allow_redirects=True)

我甚至试图发送标题和数据以及网址,但我不确定这是否是正确的方法.

headers = {'content-type': 'text/html'}
payload = {'ie':'UTF8','node':'976419031'}
r = requests.post(url,data=payload,headers=headers,allow_redirects=True)

Run Code Online (Sandbox Code Playgroud)

如何解决此错误.出于好奇,我甚至尝试过美丽的汤,但我得到了不同但同样的错误

page = BeautifulSoup(urllib2.urlopen(url))

urllib2.HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup python-2.7 python-requests

16
推荐指数

2
解决办法

2万
查看次数

如何使用CSS选择器使用BeautifulSoup检索位于某个类中的特定链接？

我是Python的新手,我正在学习它用于抓取目的我使用BeautifulSoup来收集链接(即'a'标签的href).我正在尝试收集网站http://allevents.in/lahore/的"UPCOMING EVENTS"标签下的链接.我正在使用Firebug来检查元素并获取CSS路径,但此代码没有返回任何内容.我正在寻找修复程序以及如何选择适当的CSS选择器以从任何站点检索所需链接的一些建议.我写了这段代码:

from bs4 import BeautifulSoup

import requests

url = "http://allevents.in/lahore/"

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
    print link.get('href')

Run Code Online (Sandbox Code Playgroud)

css python firebug beautifulsoup css-selectors

16
推荐指数

3
解决办法

4万
查看次数

BeautifulSoup喜欢为nodejs刮刀

我是前python开发人员,我已经使用BS4几年了现在我正在开发节点和是cheerio包是非常好的,但我需要像BS4一样的smth在节点中抓取

是否有一些替代cheerio？谢谢!

javascript beautifulsoup node.js web-scraping cheerio

16
推荐指数

1
解决办法

7420
查看次数

获取所有带有Beautiful Soup的HTML标签

我想从美丽的汤中获取所有html标签的列表.

我看到了所有但我必须在搜索之前知道标签的名称.

如果有文字就好

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

Run Code Online (Sandbox Code Playgroud)

我怎样才能得到像这样的清单

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

Run Code Online (Sandbox Code Playgroud)

我知道如何用正则表达式做到这一点,但我正在尝试学习BS4

html python beautifulsoup

16
推荐指数

2
解决办法

2万
查看次数

find函数的参数

我正在使用美丽的汤(在Python中).我有这样隐藏的输入对象:

<input type="hidden" name="form_build_id" id="form-531f740522f8c290ead9b88f3da026d2" value="form-531f740522f8c290ead9b88f3da026d2"  />

Run Code Online (Sandbox Code Playgroud)

我需要id/value.

这是我的代码:

mainPageData = cookieOpener.open('http://page.com').read()
soupHandler = BeautifulSoup(mainPageData)

areaId = soupHandler.find('input', name='form_build_id', type='hidden')

TypeError: find() got multiple values for keyword argument 'name'

Run Code Online (Sandbox Code Playgroud)

我试图改变代码:

print soupHandler.find(name='form_build_id', type='hidden')
None

Run Code Online (Sandbox Code Playgroud)

怎么了？

python beautifulsoup find

15
推荐指数

1
解决办法

5736
查看次数

克隆元素与beautifulsoup

我必须将一个文档的一部分复制到另一个文档,但我不想修改我复制的文档.

如果我使用.extract()它从树中删除元素.如果我只是附加所选元素,document2.append(document1.tag)它仍然会从document1中删除元素.

当我使用真实文件时,我可以在修改后不保存document1,但有没有办法在不损坏文档的情况下执行此操作？

python beautifulsoup

15
推荐指数

3
解决办法

5342
查看次数

网络搜索SEC Edgar 10-K和10-Q文件

有没有人有刮刮SEC 10-K和10-Q备案的经验？我试图从这些文件中删除每月实现的股票回购时遇到困难.具体而言,我想获得以下信息:1.期间; 2.购买的股份总数; 3.每股平均支付价格; 4.作为公开宣布的计划或计划的一部分购买的股份总数; 5.从2004年到2014年,每个月根据计划或计划购买的股票的最大数量(或近似美元价值).我总共有90,000多种表格需要解析,所以这样做是不可行的手动.

此信息通常在10-Ks的"第2部分项目5注册人普通股权市场,相关股东事项和发行人购买股权证券"和"第2部分第2项未注册的股权证券销售和所得款项用途"中报告.

以下是我需要解析的10-Q文件的一个示例:https: //www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm

如果公司没有股票回购,则季度报告中可能会缺少此表.

我试图用Python BeautifulSoup解析html文件,但结果并不令人满意,主要是因为这些文件不是以一致的格式编写的.

例如,我能想到解析这些表单的唯一方法是

from bs4 import BeautifulSoup
import requests
import unicodedata
import re

url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'

def parse_html(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    tables = soup.find_all('table') 

    identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)

    n = len(tables) -1
    rep_tables = []

    while n >= 0:
        table = tables[n]
        remove_invalid_tags(table)
        table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
        if re.search(identifier, table_text):
            rep_tables += [table]
            n -= 1
        else:
            n -= 1

    return rep_tables

def remove_invalid_tags(soup, invalid_tags=['sup', 'br']): …

Run Code Online (Sandbox Code Playgroud)

beautifulsoup web-scraping edgar

15
推荐指数

1
解决办法

1万
查看次数

禁用特殊的"类"属性处理

故事:

解析HTML时BeautifulSoup,class属性被视为多值属性,并以特殊方式处理:

请记住,单个标记的"class"属性可以有多个值.当您搜索与某个CSS类匹配的标记时,您将匹配其任何CSS类.

此外,作为其他树构建器类的基础HTMLTreeBuilder使用的内置引用BeautifulSoup,例如,HTMLParserTreeBuilder:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon …

Run Code Online (Sandbox Code Playgroud)

html python beautifulsoup html-parsing

15
推荐指数

1
解决办法

379
查看次数

标签统计

beautifulsoup ×10

html ×2

web-scraping ×2

css ×1

css-selectors ×1

easy-install ×1

find ×1

html-parsing ×1

lxml ×1

python-requests ×1

«
1
…
12
13
14
15
16
…
170
»