相关疑难解决方法(0)

在Python字符串中解码HTML实体？

我正在使用Beautiful Soup 3解析一些HTML,但它包含HTML实体,Beautiful Soup 3不会自动为我解码:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

Run Code Online (Sandbox Code Playgroud)

我怎样才能解码HTML实体中text获得"£682m",而不是"£682m".

html python html-entities

jkp*_*jkp

2015 11-29

239
推荐指数

4
解决办法

20万
查看次数

BeautifulSoup innerhtml？

假设我有一个页面div.我可以很容易地得到那个div soup.find().

现在我已经得到了结果,我想要打印出innerhtml它的全部内容div:我的意思是,我需要一个包含所有html标签和文本的字符串,就像我在javascript中获得的字符串一样obj.innerHTML.这可能吗？

html python beautifulsoup innerhtml

Mat*_*nti

2011 11-14

39
推荐指数

3
解决办法

3万
查看次数

如何防止BeautifulSoup4在汤中添加额外的<html> <body>标签？

在3之前的BeautifulSoup版本中,我可以使用任何一块HTML并以这种方式获取字符串表示:

from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
    '<div><b>soup</b></div>'

Run Code Online (Sandbox Code Playgroud)

但是使用BeautifulSoup4时,相同的操作会创建其他标签:

from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
    '<html><body><div><b>soup 4</b></div></body></html>'
     ^^^^^^^^^^^^                        ^^^^^^^^^^^^^^

Run Code Online (Sandbox Code Playgroud)

我不需要<html><body>..</body></html>BS4添加的外部标签.我查看了BS4文档并在类中搜索但是找不到任何设置来抑制输出中的额外标记.我该怎么做？降级到V3是不是一种选择,因为在BS3中使用的SGML解析器不近的一样好lxml或html5lib可用以BS4解析器.

python beautifulsoup

ccp*_*zza

2018 03-11

16
推荐指数

1
解决办法

3727
查看次数

在BeautifulSoup中包含带有标签的文本子部分

我希望BeautifulSoup等同于这个jQuery问题.

我想在BeautifulSoup文本中找到一个特定的正则表达式匹配,然后用包装版本替换该段文本.我可以用明文包装做到这一点:

# replace all words ending in "ug" wrapped in quotes,
# with "ug" replaced with "ook"

>>> soup = BeautifulSoup("Snug as a bug in a rug")
>>> soup
<html><body><p>Snug as a bug in a rug</p></body></html>
>>> for text in soup.findAll(text=True):
...   if re.search(r'ug\b',text):
...     text.replaceWith(re.sub(r'(\w*)ug\b',r'"\1ook"',text))
...
u'Snug as a bug in a rug'
>>> soup
<html><body><p>"Snook" as a "book" in a "rook"</p></body></html>

Run Code Online (Sandbox Code Playgroud)

但是,如果我想要粗体而不是引号呢？例如,期望的结果=

<html><body><p><b>Snook</b> as a <b>book</b> in a <b>rook</b></p></body></html>

Run Code Online (Sandbox Code Playgroud)

html python regex beautifulsoup

Jas*_*n S

2017 05-23

6
推荐指数

2
解决办法

1336
查看次数

让外部链接在一个令人讨厌的新窗口中打开

我最近实现了添加target="_blank"到这样的外部链接:

@hooks.register('after_edit_page')
def do_after_page_edit(request, page):
    if hasattr(page, "body"):
        soup = BeautifulSoup(page.body)
        for a in soup.findAll('a'):
            if hasattr(a, "href"):
            a["target"] = "_blank"
        page.body = str(soup)
        page.body = page.body.replace("<html><head></head><body>", "")
        page.body = page.body.replace("</body></html>", "")
        page.body = page.body.replace("></embed>", "/>")
        page.save()

@hooks.register('construct_whitelister_element_rules')
def whitelister_element_rules():
    return {
        'a': attribute_rule({'href': check_url, 'target': True}),
    }

Run Code Online (Sandbox Code Playgroud)

问题: