我想使用Scrapy从给定的网站获取所有外部链接.使用以下代码,蜘蛛也会抓取外部链接:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem
class someSpider(CrawlSpider):
name = 'crawltest'
allowed_domains = ['someurl.com']
start_urls = ['http://www.someurl.com/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
item = someItem()
item['url'] = response.url
return item
Run Code Online (Sandbox Code Playgroud)
我错过了什么?"allowed_domains"是否阻止外部链接被抓取?如果我为LinkExtractor设置"allow_domains",它不会提取外部链接.只是为了澄清:我不想抓取内部链接,但提取外部链接.任何帮助appriciated!
我有一个声明如下的存储过程:
CREATE DEFINER=`blabla`@`%` PROCEDURE `getAllDomainsByCountry`(IN dom_id INT)
BEGIN
SELECT
domain.id,
IFNULL(domain.indexed, '-') AS indexed,
domain.name,
country.language_code,
IFNULL(ip_adress.adress, '-') AS adress,
IFNULL(GROUP_CONCAT(category.name
SEPARATOR ', '),
'-') AS categories,
IFNULL(GROUP_CONCAT(category.id
SEPARATOR ', '),
'-') AS categories_id,
(SELECT
IFNULL(GROUP_CONCAT(DISTINCT client.name
SEPARATOR ', '),
'-')
FROM
link
LEFT JOIN
client_site ON link.client_site = client_site.id
LEFT JOIN
client ON client.id = client_site.client
WHERE
link.from_domain = domain.id) AS clients,
IFNULL(domain_host.name, '-') AS domain_host_account,
IFNULL(content_host.name, '-') AS content_host,
status.id AS status,
status.name AS status_name
FROM
domain
LEFT …Run Code Online (Sandbox Code Playgroud) 是否可以从 scrapy 的调度程序队列中删除请求?我有一个工作例程,限制在一定时间内爬行到某个域。它的工作原理是,一旦达到时间限制,它就不会再产生任何链接,但由于队列已经包含数千个对该域的请求,我想在达到时间限制后将它们从调度程序队列中删除。
python ×2
scrapy ×2
left-join ×1
mysql ×1
parameters ×1
performance ×1
scrape ×1
web-crawler ×1