NodeJS HTTP 请求队列

Question

NodeJS HTTP 请求队列

我使用 puppeteer 和 node js (express) 创建了刮板。这个想法是当服务器收到 http 请求时，我的应用程序将开始抓取页面。

问题是我的应用程序是否一次收到多个 http 请求。抓取过程将一遍又一遍地开始，直到没有 http 请求命中。我如何只启动一个 http 请求并将另一个请求排队，直到第一个抓取过程完成？

目前，我已经尝试过使用以下代码的节点请求队列，但没有成功。

var express = require("express");
var app = express();
var reload = require("express-reload");
var bodyParser = require("body-parser");
const router = require("./routes");
const RequestQueue = require("node-request-queue");

app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());

var port = process.env.PORT || 8080;

app.use(express.static("public")); // static assets eg css, images, js

let rq = new RequestQueue(1);

rq.on("resolved", res => {})
  .on("rejected", err => {})
  .on("completed", () => {});

rq.push(app.use("/wa", router));

app.listen(port);
console.log("Magic happens on port " + port);

Run Code Online (Sandbox Code Playgroud)

Answer 1

Md.*_*her 7

node-request-queue是为request包创建的，不同于express.

您可以使用最简单的承诺队列库p-queue来完成队列。它具有并发支持并且看起来比任何其他库都更具可读性。您可以轻松地从 Promise 切换到一个健壮的队列，就像bull稍后一样。

这是您创建队列的方法，

const PQueue = require("p-queue");
const queue = new PQueue({ concurrency: 1 });

Run Code Online (Sandbox Code Playgroud)

这是您如何向队列添加异步函数，如果您收听它，它将返回已解析的数据，

queue.add(() => scrape(url));

Run Code Online (Sandbox Code Playgroud)

因此，与其将路由添加到队列中，您只需删除它周围的其他线路并保持路由器原样。

// here goes one route
app.use('/wa', router);

Run Code Online (Sandbox Code Playgroud)

在您的路由器文件之一中，

const routes = require("express").Router();

const PQueue = require("p-queue");
// create a new queue, and pass how many you want to scrape at once
const queue = new PQueue({ concurrency: 1 });

// our scraper function lives outside route to keep things clean
// the dummy function returns the title of provided url
const scrape = require('../scraper');

async function queueScraper(url) {
  return queue.add(() => scrape(url));
}

routes.post("/", async (req, res) => {
  const result = await queueScraper(req.body.url);
  res.status(200).json(result);
});

module.exports = routes;

Run Code Online (Sandbox Code Playgroud)

确保将队列包含在路由内，而不是相反。仅在您的routes文件或运行刮刀的任何地方创建一个队列。

这是scraper文件的内容，你可以使用任何你想要的内容，这只是一个工作假人，

const puppeteer = require('puppeteer');

// a dummy scraper function
// launches a browser and gets title
async function scrape(url){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  const title = await page.title();
  await browser.close();
  return title
}

module.exports = scrape;

Run Code Online (Sandbox Code Playgroud)

使用 curl 的结果：

这是我的 git repo，其中包含带有示例队列的工作代码。

警告

如果您使用任何此类队列，您会注意到您在同时处理 100 个结果时遇到问题，并且对您的 api 的请求将持续超时，因为队列中还有 99 个其他 url 正在等待。这就是为什么你必须在以后了解更多关于真正的队列和并发的原因。

一旦您了解了队列的工作原理，有关 cluster-puppeteer、rabbitMQ、公牛队列等的其他答案将在那时对您有所帮助:)。

Answer 2

Tho*_*orf 5

您可以使用puppeteer-cluster来实现这一点（免责声明：我是作者）。您可以设置一个只有一个工作线程池的集群。因此，分配给集群的作业将会被陆续执行。

由于您没有说明您的 puppeteer 脚本应该做什么，因此在此代码示例中，我提取页面标题作为示例（通过给出/wa?url=...）并将结果提供给响应。

// setup the cluster with only one worker in the pool
const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 1,
});

// define your task (in this example we extract the title of the given page)
await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    return await page.evaluate(() => document.title);
});

// Listen for the request
app.get('/wa', async function (req, res) {
    // cluster.execute will run the job with the workers in the pool. As there is only one worker
    // in the pool, the jobs will be run sequentially
    const result = await cluster.execute(req.query.url);
    res.end(result);
});

Run Code Online (Sandbox Code Playgroud)

这是一个最小的例子。您可能想捕获侦听器中的任何错误。有关更多信息，请查看使用存储库中的 Express 的屏幕截图服务器的更复杂示例。

归档时间：	6 年，7 月前
查看次数：	9836 次
最近记录：	4 年，4 月前