我有一只蜘蛛,从蜘蛛开始时的小清单allowed_domains开始.我需要动态地向这个白名单中添加更多域,因为蜘蛛从解析器中继续进行,但由于后续请求仍在被过滤,因此下面的一段代码无法完成.allowed_domains解析器中是否有另一个更新?
class APSpider(BaseSpider):
name = "APSpider"
allowed_domains = ["www.somedomain.com"]
start_urls = [
"http://www.somedomain.com/list-of-websites",
]
...
def parse(self, response):
soup = BeautifulSoup( response.body )
for link_tag in soup.findAll('td',{'class':'half-width'}):
_website = link_tag.find('a')['href']
u = urlparse.urlparse(_website)
self.allowed_domains.append(u.netloc)
yield Request(url=_website, callback=self.parse_secondary_site)
...
Run Code Online (Sandbox Code Playgroud) 在Java中,构造函数不能是递归的.编译时错误:"递归构造函数调用".我们假设我们没有这个限制.
要记住的事情:
允许递归构造函数会有什么好处吗?
当实际发送的消息数在3000+以上时,以下脚本仅从已发送文件夹中返回1000条消息
我怎样才能得到其余的消息?
username = ask("Enter your username: ") { |q| q.echo = true }
password = ask("Enter your password: ") { |q| q.echo = "*" }
look_in_folder = "[Gmail]/Sent Mail"
save_to_folder = "/Users/penang/Desktop"
puts 'Starting...'
imap = Net::IMAP.new('imap.gmail.com', '993', true)
puts "Logging in as #{username} ..."
imap.login(username, password)
imap.examine(look_in_folder)
mails = imap.uid_search(["FROM", "me"])
puts "Found #{mails.count} mail(s) in folder '#{look_in_folder}'"
Run Code Online (Sandbox Code Playgroud)