需要帮助将相对url转换为scrapy spider中的绝对url.我需要将我的起始页面上的链接转换为绝对URL以获取已绘制项目的图像,这些图像位于起始页面上.我没有成功尝试不同的方法来实现这一点,我陷入了困境.有什么建议吗?
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/billboard",
"http://www.example.com/billboard?page=1"
]
def parse(self, response):
image_urls = response.xpath('//div[@class="content"]/section[2]/div[2]/div/div/div/a/article/img/@src').extract()
relative_url = response.xpath(u'''//div[contains(concat(" ", normalize-space(@class), " "), " content ")]/a/@href''').extract()
for image_url,url in zip(image_urls,absolute_urls):
item = ExampleItem()
item['image_urls'] = image_urls
request = Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request
Run Code Online (Sandbox Code Playgroud)
Pau*_*ira 13
主要有三种方法可以实现这一目标:
使用以下urljoin功能urllib:
from urllib.parse import urljoin
# Same as: from w3lib.url import urljoin
url = urljoin(base_url, relative_url)
Run Code Online (Sandbox Code Playgroud)使用响应的urljoin包装器方法,如Steve所述.
url = response.urljoin(relative_url)
Run Code Online (Sandbox Code Playgroud)如果您还想从该链接发出请求,可以使用少数响应的follow方法:
# It will create a new request using the above "urljoin" method
yield response.follow(relative_url, callback=self.parse)
Run Code Online (Sandbox Code Playgroud)| 归档时间: |
|
| 查看次数: |
3967 次 |
| 最近记录: |