Gra*_*rus 0 python string beautifulsoup web-scraping
我有:
<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');"
Run Code Online (Sandbox Code Playgroud)
我想获取网址,但是如果不使用正则表达式,我将无法做到这一点。可能吗?
到目前为止,我使用正则表达式的解决方案是:
url = re.findall('\('(.*?)'\)', soup['style'])[0]
Run Code Online (Sandbox Code Playgroud)
您可以尝试使用cssutils软件包。这样的事情应该起作用:
import cssutils
from bs4 import BeautifulSoup
html = """<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');" />"""
soup = BeautifulSoup(html)
div_style = soup.find('div')['style']
style = cssutils.parseStyle(div_style)
url = style['background-image']
>>> url
u'url(/uploads/images/players/16113-1399107741.jpeg)'
>>> url = url.replace('url(', '').replace(')', '') # or regex/split/find/slice etc.
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Run Code Online (Sandbox Code Playgroud)
尽管您最终将需要解析实际的url,但此方法应该对HTML的更改更具弹性。如果您真的不喜欢字符串操作和正则表达式,则可以通过这种回旋方式拉出网址:
sheet = cssutils.css.CSSStyleSheet()
sheet.add("dummy_selector { %s }" % div_style)
url = list(cssutils.getUrls(sheet))[0]
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
5324 次 |
最近记录: |