<div class="someClass">
<a href="href">
<img alt="some" src="some"/>
</a>
</div>
Run Code Online (Sandbox Code Playgroud)
我使用bs4而我无法使用a.attrs['src']
它src
,但我可以得到href
.我该怎么办?
Abu*_*oeb 24
您可以使用BeautifulSoup来提取html img标签的src属性.在我的示例中,htmlText包含img标记,但如果使用urllib2,它也可以使用URL.
对于URL
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
Run Code Online (Sandbox Code Playgroud)
对于带有img标签的文本
from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print image['src']
Run Code Online (Sandbox Code Playgroud)
小智 8
这是一个解决方案,如果img标签没有src属性,则不会触发 KeyError:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
Run Code Online (Sandbox Code Playgroud)
\n\n您可以使用 Beautiful Soup 提取HTML 标签的src
\nimg
属性。在我的示例中,htmlText
包含img
标签本身,但这也可以用于 URL,以及urllib2
。
Abu Shoeb 的回答提供的解决方案不再适用于 Python\xc2\xa03。这是正确的实现:
\n对于网址
\nfrom bs4 import BeautifulSoup as BSHTML\nimport urllib3\n\nhttp = urllib3.PoolManager()\nurl = \'your_url\'\n\nresponse = http.request(\'GET\', url)\nsoup = BSHTML(response.data, "html.parser")\nimages = soup.findAll(\'img\')\n\nfor image in images:\n # Print image source\n print(image[\'src\'])\n # Print alternate text\n print(image[\'alt\'])\n
Run Code Online (Sandbox Code Playgroud)\n对于带有“img”标签的文本
\nfrom bs4 import BeautifulSoup as BSHTML\nhtmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """\nsoup = BSHTML(htmlText)\nimages = soup.findAll(\'img\')\nfor image in images:\n print(image[\'src\'])\n
Run Code Online (Sandbox Code Playgroud)\n
链接没有src
您必须定位实际img
标记的属性.
import bs4
html = """<div class="someClass">
<a href="href">
<img alt="some" src="some"/>
</a>
</div>"""
soup = bs4.BeautifulSoup(html, "html.parser")
# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']
>>> 'some'
# if you have more then one 'a' tag
for a in soup.find_all('a'):
if a.img:
print(a.img['src'])
>>> 'some'
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
27273 次 |
最近记录: |