Scrapy就像Nodejs的工具一样？

Question

Scrapy就像Nodejs的工具一样？

use*_*940 8 javascript scrapy node.js web-scraping cheerio

我想知道是否有针对nodejs的Scrapy之类的东西？如果不是你怎么看待使用简单的页面下载并使用cheerio解析它？有没有更好的办法.

Answer 1

Scrapy 是一个为 python 添加异步 IO 的库。我们没有像 node 这样的东西的原因是因为所有 IO 已经是异步的（除非你不需要它）。

以下是 node 中的 scrapy 脚本可能的样子，并注意 url 是并发处理的。

const cheerio = require('cheerio');
const axios = require('axios');

const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']

// this might be called a "middleware" in scrapy.
const get = async url => {
  const response = await axios.get(url)
  return cheerio.load(response.data)
}

// this too.
const output = item => {
  console.log(item)
}

// here is parse which is the initial scrapy callback
const parse = async url => {
  const $ = await get(url)
  output({url, title: $('title').text()})
}

// and here is the main execution
startUrls.map(url => parse(url))

Run Code Online (Sandbox Code Playgroud)

Answer 2

Sta*_*tan 3

我还没有见过像Python中的Scrapy这样强大的爬行/索引整个网站的解决方案，所以我个人使用Python Scrapy来爬行网站。

但是为了从页面抓取数据，nodejs 中有casperjs。这是一个非常酷的解决方案。它也适用于 ajax 网站，例如 angular-js 页面。Python Scrapy 无法解析ajax 页面。因此，为了抓取一页或几页的数据，我更喜欢使用 CasperJs。

Cheerio确实比 casperjs 更快，但它不能与 ajax 页面配合使用，而且它没有像 casperjs 那样良好的代码结构。所以我更喜欢 casperjs，即使你可以使用 Cheerio 包。

咖啡脚本示例：

casper.start 'https://reports.something.com/login', ->
  this.fill 'form',
    username: params.username
    password: params.password
  , true

casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
  this.click 'input'

casper.then ->
  get = (number) =>
    value = this.fetchText("tr[bgcolor= '#AFC5E4'] >  td:nth-of-type(#{number})").trim()

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，2 月前
查看次数：	4961 次
最近记录：	6 年，4 月前