我编写了一个 scrapy 蜘蛛,它有许多 start_url 并在这些 url 中提取电子邮件地址。该脚本需要很长时间才能执行,因此我想告诉 Scrapy 在发现电子邮件时停止抓取特定网站并移至下一个网站。
编辑:添加代码
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
import csv
from urlparse import urlparse
from entreprise.items import MailItem
class MailSpider(CrawlSpider):
name = "mail"
start_urls = []
allowed_domains = []
with open('scraped_data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(reader)
for row in reader:
url = row[5].strip()
if (url.strip() != ""):
start_urls.append(url)
fragments = urlparse(url).hostname.split(".")
hostname = ".".join(len(fragments[-2]) < 4 …Run Code Online (Sandbox Code Playgroud) 我正在编写一个脚本,生成数百万个项目的列表,然后根据第一个列表生成另一个列表.它非常快速地填充内存,脚本无法继续.我认为将列表直接存储在文件中然后直接在文件行上循环是个好主意.最有效的方法是什么?
编辑:
我试图逐行生成一棵树.row5_nodes可以获得一百万个项目,我无法删除它,因为我用它来生成row6_nodes
import random
class Node:
def __init__(self, id, name, parent=None):
self.id = id
self.name = name
self.parent = parent
def write_roots(root_nodes, roots):
global index
index = 0
for x in xrange(0,roots):
node = Node(index,"root"+str(x))
root_nodes.append(node);
f.write(str(node.id)+","+str(node.name)+","+str(node.parent)+"\n")
index += 1;
return
def write_row(parent_nodes, new_nodes, children):
global index
for parent_node in parent_nodes:
for x in xrange(0,children):
node = Node(index,"cat"+str(parent_node.id)+"-"+str(x), parent_node.id)
new_nodes.append(node);
f.write(str(node.id)+","+str(node.name)+","+str(node.parent)+"\n")
index += 1;
return
f = open("data.csv", "wb")
roots = 1000
root_nodes =[]
row1_nodes =[]
row2_nodes =[]
row3_nodes …Run Code Online (Sandbox Code Playgroud)