标签: beautifulsoup

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

Run Code Online (Sandbox Code Playgroud)

html python parsing beautifulsoup

tim*_*tim

lucky-day

25
推荐指数

3
解决办法

5万
查看次数

Python和BeautifulSoup编码问题

我正在使用BeautifulSoup编写一个使用Python的爬虫,一切都在游泳,直到我遇到这个网站:

http://www.elnorte.ec/

我正在获取请求库的内容:

r = requests.get('http://www.elnorte.ec/')
content = r.content

Run Code Online (Sandbox Code Playgroud)

如果我在那时打印内容变量,所有西班牙语特殊字符似乎都正常工作.但是,一旦我尝试将内容变量提供给BeautifulSoup,一切都搞砸了:

soup = BeautifulSoup(content)
print(soup)
...
<a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&amp;month=08&amp;day=27&amp;modid=203" title="1009 artÃculos en este dÃa">
...

Run Code Online (Sandbox Code Playgroud)

它显然是在拼乱所有西班牙语的特殊角色(口音和诸如此类的东西).我尝试过做content.decode('utf-8'),content.decode('latin-1'),也尝试将fromEncoding参数搞砸到BeautifulSoup,将其设置为fromEncoding ='utf-8'和fromEncoding ='拉丁-1',但仍然没有骰子.

任何指针都将非常感激.

python unicode beautifulsoup utf-8

Dav*_*vid

lucky-day

25
推荐指数

4
解决办法

5万
查看次数

如何使用BeautifulSoup正确地将UTF-8编码的HTML解析为Unicode字符串？

我正在运行一个Python程序,它获取一个UTF-8编码的网页,我使用BeautifulSoup从HTML中提取一些文本.

但是,当我将此文本写入文件(或在控制台上打印)时,它将以意外编码形式写入.

示例程序:

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

# Parse with BeautifulSoup
soup = BeautifulSoup(response)

# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])

Run Code Online (Sandbox Code Playgroud)

运行它会得到结果:

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

Run Code Online (Sandbox Code Playgroud)

但我希望Python Unicode字符串ö在单词中呈现können为\xf6:

# …

Run Code Online (Sandbox Code Playgroud)

python unicode urllib2 beautifulsoup utf-8

Chr*_*Orr

lucky-day

25
推荐指数

1
解决办法

7万
查看次数

是否有OrderedDict理解？

我不知道是否有这样的事情 - 但我正在努力做一个有序的字典理解.但它似乎没有用？

import requests
from bs4 import BeautifulSoup
from collections import OrderedDict


soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
t_data = OrderedDict()
rows = tables[1].find_all('tr')
t_data = {row.th.text: row.td.text for row in rows if row.td }

Run Code Online (Sandbox Code Playgroud)

它现在仍然是一个正常的字典理解(我也遗漏了对汤样板的通常要求).有任何想法吗？

python beautifulsoup

Yun*_*nti

2016 07-22

25
推荐指数

1
解决办法

7566
查看次数

How to parse an HTML table with rowspans in Python?

The problem

I'm trying to parse an HTML table with rowspans in it, as in, I'm trying to parse my college schedule.

I'm running into the problem where if the last row contains a rowspan, the next row is missing a TD where the rowspan is now that TD that is missing.

I have no clue how to account for this and I hope to be able to parse this schedule.

What I tried

Pretty much everything I can think …

html python html-table beautifulsoup python-3.x

iSe*_*els

2016 09-12

25
推荐指数

1
解决办法

1562
查看次数

Python BeautifulSoup:通配符属性/ id搜索

我有这个:

dates = soup.findAll("div", {"id" : "date"})

Run Code Online (Sandbox Code Playgroud)

不过,我需要的id是一个通配符搜索,因为id可以date_1,date_2等等.

python beautifulsoup

use*_*003

2013 01-10

24
推荐指数

1
解决办法

2万
查看次数

如何将新标签插入到BeautifulSoup对象中？

试图用BS来解决html构建问题.

我正在尝试插入新标签:

self.new_soup.body.insert(3, """<div id="file_history"></div>""")

Run Code Online (Sandbox Code Playgroud)

当我检查结果时,我得到:

&lt;div id="file_histor"y&gt;&lt;/div&gt;

Run Code Online (Sandbox Code Playgroud)

所以我正在插入一个为websafe html进行清理的字符串..

我期望看到的是:

<div id="file_history"></div>

Run Code Online (Sandbox Code Playgroud)

如何div在ID为3的位置插入新标签file_history？

python beautifulsoup

Jay*_*uso

2016 02-01

24
推荐指数

3
解决办法

3万
查看次数

使用BeautifulSoup删除特定类的div

我想div从soup对象中删除特定的.
我正在使用python 2.7和bs4.

根据我们可以使用的文档div.decompose().

但这会删除所有的div.如何删除div特定类？

python beautifulsoup python-2.7

Rik*_*hah

2015 08-18

24
推荐指数

3
解决办法

2万
查看次数

标签统计

beautifulsoup ×10

python ×10

html ×2

python-3.x ×2

unicode ×2

utf-8 ×2

html-table ×1

parsing ×1

python-2.7 ×1

urllib2 ×1

标签 统计

标签统计