我正在编写一个爬虫爬虫来抓取 youtube 视频并捕获、名称、订阅者计数、链接等。我从教程中复制了这个 SQLalchemy 代码并让它工作,但是每次我运行爬虫时,我都会在数据库中得到重复的信息。
我如何检查抓取的数据是否已经在数据库中,如果是,请不要进入数据库....
这是我的 pipeline.py 代码
from sqlalchemy.orm import sessionmaker
from models import Channels, db_connect, create_channel_table
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class YtscraperPipeline(object):
"""YTscraper pipeline for storing scraped items in the database"""
def __init__(self):
#Initializes database connection and sessionmaker.
#Creates deals table.
engine = db_connect()
create_channel_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save youtube channel …Run Code Online (Sandbox Code Playgroud)