Mun*_*b K 3 html python beautifulsoup html-parsing python-3.x
它是一个用于下载图像,音频,视频等的项目。但是在某些网站上,我发现没有完整的链接。只是相对路径。所以我不知道如何获得那些相对链接。
我的整个项目在:
https://github.com/MuneebKalathil/MaD
Run Code Online (Sandbox Code Playgroud)
这是我的示例链接,我想从该链接下载所有图像。有缩略图,但是我不要那个图像。如果单击缩略图,它将转到原始图像页面。我要下载该图像
http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx
Run Code Online (Sandbox Code Playgroud)
某些来源是:
<tr>
<td id='pagingCell'>
</td>
</tr>
<tr>
<td align='center'><div id='galdiv' style='float:center;margin-right:3px;;margin-bottom:3px'>
<a href='/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars/image1.aspx' ><img src="http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham1t.jpg" alt="Kajal Aggarwal" title="Kajal Aggarwal at Dine with Stars Memu Saitham"></a>
Run Code Online (Sandbox Code Playgroud)
并且,我想先获得一个相对链接地址:
/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars/image1.aspx
Run Code Online (Sandbox Code Playgroud)
并找到它的绝对路径。
定义基本URL,找到所有img标记,如果src属性值不是以开头http,则使用urlparse.urljoin()来将基本URL和连接在一起src。
示例,使用requests和BeautifulSoup:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.ragalahari.com'
url = 'http://www.ragalahari.com/actress/14035/kajal-aggarwal-at-memu-saitham-dinner-with-stars.aspx'
soup = BeautifulSoup(requests.get(url).content)
for img in soup.find_all('img', src=True):
src = img.get('src')
if not src.startswith('http'):
src = urljoin(base_url, src)
print(src)
Run Code Online (Sandbox Code Playgroud)
印刷品:
http://icdn.raagalahari.com/images/ragalaharilogo.png
http://www.ragalahari.com/images/helpicon.png
http://www.ragalahari.com/images/rssicon.png
http://www.ragalahari.com/images/twittericon.png
http://www.ragalahari.com/images/facebookicon.png
http://www.ragalahari.com/images/searchicon.png
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham1t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham2t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham3t.jpg
http://imgcdn.raagalahari.com/nov2014/starzone/kajal-agarwal-memu-saitham/kajal-agarwal-memu-saitham4t.jpg
...
Run Code Online (Sandbox Code Playgroud)
更新(获取a链接的部分代码):
for a in soup.select('div#galdiv a'):
link = a.get('href')
if not link.startswith('http'):
link = urljoin(base_url, link)
print(link)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4630 次 |
| 最近记录: |