mar*_*ryo 8 python mysql web-crawler scrapy
我试图使用spider.py从MYSQL表中使用SELECT填充start_url .当我运行"scrapy runspider spider.py"时,我没有输出,只是它完成没有错误.
我已经在python脚本中测试了SELECT查询,并且使用来自MYSQL表的entrys来填充start_url.
spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
import MySQLdb
class ProductsSpider(BaseSpider):
name = "Products"
allowed_domains = ["test.com"]
start_urls = []
def parse(self, response):
print self.start_urls
def populate_start_urls(self, url):
conn = MySQLdb.connect(
user='user',
passwd='password',
db='scrapy',
host='localhost',
charset="utf8",
use_unicode=True
)
cursor = conn.cursor()
cursor.execute(
'SELECT url FROM links;'
)
rows = cursor.fetchall()
for row in rows:
start_urls.append(row[0])
conn.close()
Run Code Online (Sandbox Code Playgroud)
Sha*_*ans 13
更好的方法是覆盖start_requests方法.
这可以查询您的数据库,就像populate_start_urls返回一系列Request对象一样.
您只需要将populate_start_urls方法重命名为start_requests并修改以下行:
for row in rows:
yield self.make_requests_from_url(row[0])
Run Code Online (Sandbox Code Playgroud)
写下填充__init__:
def __init__(self):
super(ProductsSpider,self).__init__()
self.start_urls = get_start_urls()
Run Code Online (Sandbox Code Playgroud)
假设get_start_urls()返回网址.
| 归档时间: |
|
| 查看次数: |
3937 次 |
| 最近记录: |