使用BeautifulSoup从`img`标签中提取`src`属性

Question

使用BeautifulSoup从`img`标签中提取`src`属性

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

Run Code Online (Sandbox Code Playgroud)

我使用bs4而我无法使用a.attrs['src']它src,但我可以得到href.我该怎么办？

Answer 1

Abu*_*oeb 24

您可以使用BeautifulSoup来提取html img标签的src属性.在我的示例中,htmlText包含img标记,但如果使用urllib2,它也可以使用URL.

对于URL

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    #print image source
    print image['src']
    #print alternate text
    print image['alt']

Run Code Online (Sandbox Code Playgroud)

对于带有img标签的文本

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 8

这是一个解决方案，如果img标签没有src属性，则不会触发 KeyError：

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

images = bs.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])

Run Code Online (Sandbox Code Playgroud)

Answer 3

Gra*_*ray 7

\n
您可以使用 Beautiful Soup 提取HTML 标签的srcimg属性。在我的示例中，htmlText包含img标签本身，但这也可以用于 URL，以及urllib2。
\n

\n

Abu Shoeb 的回答提供的解决方案不再适用于 Python\xc2\xa03。这是正确的实现：

\n

对于网址

\n

from bs4 import BeautifulSoup as BSHTML\nimport urllib3\n\nhttp = urllib3.PoolManager()\nurl = \'your_url\'\n\nresponse = http.request(\'GET\', url)\nsoup = BSHTML(response.data, "html.parser")\nimages = soup.findAll(\'img\')\n\nfor image in images:\n    # Print image source\n    print(image[\'src\'])\n    # Print alternate text\n    print(image[\'alt\'])\n

Run Code Online (Sandbox Code Playgroud)\n

对于带有“img”标签的文本

\n

from bs4 import BeautifulSoup as BSHTML\nhtmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """\nsoup = BSHTML(htmlText)\nimages = soup.findAll(\'img\')\nfor image in images:\n    print(image[\'src\'])\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 4

mx0*_*mx0 6

链接没有src您必须定位实际img标记的属性.

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	27273 次
最近记录：	6 年，10 月前