Ali*_*led 2 python scrapy web-scraping
我正在尝试使用 scrapy 和 playwright 来抓取动态网页,我安装了 scrapy 和 playwright,但是,当我尝试运行我的蜘蛛时,我收到此错误。
ImportError: cannot import name 'PageCoroutine' from 'scrapy_playwright.page' (C:\Ali\DataCamp\Web Scraping in Python\Scrapy\venv\lib\site-packages\scrapy_playwright\page.py)
这是我的代码(这是测试代码):
import scrapy
from scrapy_playwright.page import PageCoroutine
class PwspiderSpider(scrapy.Spider):
name = 'pwspider'
def start_requests(self):
yield scrapy.Request("https://shoppable-campaign-demo.netlify.app/#/", meta=dict(playwright=True, playwright_include_page=True, playwright_page_coroutine=[PageCoroutine('wait_for_selector', 'div#productListing')]))
async def parse(self, response):
yield {'text': response.text}
Run Code Online (Sandbox Code Playgroud)
我什至在设置文件中添加了 DOWNLOAD_HANDLERS 和 TWISTED_REACTOR。
PageCoroutine已弃用/废弃。playwright_page_methods代替使用。
以工作代码为例:
import scrapy
from scrapy_playwright.page import PageMethod
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self):
yield scrapy.Request(
url="https://shoppable-campaign-demo.netlify.app/#/",
callback=self.parse,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", '.card-body'),
],
},
)
def parse(self, response):
products = response.xpath('//*[@class="card-body"]')
for product in products:
yield {
'title':product.xpath('.//*[@class="card-title"]/text()').get()
}
Run Code Online (Sandbox Code Playgroud)
输出:
{'title': 'Oxford Loafers'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Ankle-length Slack'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'White Baseball Cap'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Triangle Bikini Top'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Short Blazer'}
2022-11-05 20:40:40 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-05 20:40:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 235,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 39851,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 41.370211,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 5, 14, 40, 40, 261151),
'item_scraped_count': 5,
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2504 次 |
| 最近记录: |