我正在使用scrapy来抓取一个多语言网站.对于每个对象,存在三种不同语言的版本.我正在使用搜索作为起点.不幸的是,搜索包含各种语言的URL,这在解析时会导致问题.
因此,我想在发送之前对URL进行预处理.如果它们包含特定字符串,我想替换URL的那一部分.
我的蜘蛛延伸了CrawlSpider
.我查看了文档并找到了make_request_from _url(url)
导致此尝试的方法:
def make_requests_from_url(self, url):
"""
Override the original function go make sure only german URLs are
being used. If french or italian URLs are detected, they're
rewritten.
"""
if '/f/suche' in url:
self.log('French URL was rewritten: %s' % url)
url = url.replace('/f/suche/pages/', '/d/suche/seiten/')
elif '/i/suche' in url:
self.log('Italian URL was rewritten: %s' % url)
url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')
return super(MyMultilingualSpider, self).make_requests_from_url(url)
Run Code Online (Sandbox Code Playgroud)
但由于某种原因,这不起作用.在请求URL之前重写URL的最佳方法是什么?也许通过规则回调?
可能没什么值得一个例子,因为它花了我大约30分钟来弄明白:
rules = [
Rule(SgmlLinkExtractor(allow = (all_subdomains,)), callback='parse_item', process_links='process_links')
]
def process_links(self,links):
for link in links:
link.url = "something_to_prepend%ssomething_to_append" % link.url
return links
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2603 次 |
最近记录: |