从样式中提取URL:background-url:是否有beautifulsoup而没有正则表达式?

Gra*_*rus 0 python string beautifulsoup web-scraping

我有:

<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');"
Run Code Online (Sandbox Code Playgroud)

我想获取网址,但是如果不使用正则表达式,我将无法做到这一点。可能吗?

到目前为止,我使用正则表达式的解决方案是:

url = re.findall('\('(.*?)'\)', soup['style'])[0]
Run Code Online (Sandbox Code Playgroud)

mha*_*wke 8

您可以尝试使用cssutils软件包。这样的事情应该起作用:

import cssutils
from bs4 import BeautifulSoup

html = """<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');" />"""
soup = BeautifulSoup(html)
div_style = soup.find('div')['style']
style = cssutils.parseStyle(div_style)
url = style['background-image']

>>> url
u'url(/uploads/images/players/16113-1399107741.jpeg)'
>>> url = url.replace('url(', '').replace(')', '')    # or regex/split/find/slice etc.
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Run Code Online (Sandbox Code Playgroud)

尽管您最终将需要解析实际的url,但此方法应该对HTML的更改更具弹性。如果您真的不喜欢字符串操作和正则表达式,则可以通过这种回旋方式拉出网址:

sheet = cssutils.css.CSSStyleSheet()
sheet.add("dummy_selector { %s }" % div_style)
url = list(cssutils.getUrls(sheet))[0]
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Run Code Online (Sandbox Code Playgroud)