Nei*_*eil 5 python mysql screen-scraping scrapy
目前正在与 Scrapy 合作。
我有一个存储在 MySQL 数据库中的 URL 列表。蜘蛛访问这些 URL,捕获两个目标信息(分数和计数)。我的目标是,当 Scrapy 完成抓取时,它会在移动到下一个 URL 之前自动填充相应的列。
我是新手,我似乎无法让保存部分正常工作。分数和计数已成功传递到数据库。但它会保存为新行,而不是与源 URL 关联。
这是我的代码:amazon_spider.py
import scrapy
from whatoplaybot.items import crawledScore
import MySQLdb
class amazonSpider(scrapy.Spider):
name = "amazon"
allowed_domains = ["amazon.com"]
start_urls = []
def parse(self, response):
print self.start_urls
def start_requests(self):
conn = MySQLdb.connect(
user='root',
passwd='',
db='scraper',
host='127.0.0.1',
charset="utf8",
use_unicode=True
)
cursor = conn.cursor()
cursor.execute(
'SELECT url FROM scraped;'
)
rows = cursor.fetchall()
for row in rows:
yield self.make_requests_from_url(row[0])
conn.close()
def parse(self, response):
item = crawledScore()
item['reviewScore'] = response.xpath('//*[@id="avgRating"]/span/a/span/text()').re("[0-9,.]+")[0]
item['reviewCount'] = response.xpath('//*[@id="summaryStars"]/a/text()').re("[0-9,]+")
yield item
Run Code Online (Sandbox Code Playgroud)
管道.py
import sys
import MySQLdb
class storeScore(object):
def __init__(self):
self.conn = MySQLdb.connect(
user='root',
passwd='',
db='scraper',
host='127.0.0.1',
charset="utf8",
use_unicode=True
)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:
self.cursor.execute("""INSERT INTO scraped(score, count) VALUES (%s, %s)""", (item['reviewScore'], item['reviewCount']))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
Run Code Online (Sandbox Code Playgroud)
任何帮助和指导将非常感激。
感谢你们。
请按照以下步骤操作:
将reviewURL字段添加到您的scrapedScore项目中:
class crawledScore(scrapy.Item):
reviewScore = scrapy.Field()
reviewCount = scrapy.Field()
reviewURL = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
将响应 url保存到 item['reviewURL'] 中:
def parse(self, response):
item = crawledScore()
item['reviewScore'] = response.xpath('//*[@id="avgRating"]/span/a/span/text()').re("[0-9,.]+")[0]
item['reviewCount'] = response.xpath('//*[@id="summaryStars"]/a/text()').re("[0-9,]+")
item['reviewURL'] = response.url
yield item
Run Code Online (Sandbox Code Playgroud)
最后,在管道文件上,根据您的逻辑插入或更新:
插入:
def process_item(self, item, spider):
try:
self.cursor.execute("""INSERT INTO scraped(score, count, url) VALUES (%s, %s, %s)""", (item['reviewScore'], item['reviewCount'], item['reviewURL']))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
Run Code Online (Sandbox Code Playgroud)
更新:
def process_item(self, item, spider):
try:
self.cursor.execute("""UPDATE scraped SET score=%s, count=%s WHERE url=%s""", (item['reviewScore'], item['reviewCount'], item['reviewURL']))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1800 次 |
| 最近记录: |