我正在尝试使用 puppeteer 抓取不同的网站。由于我使用puppeteer-extra(对于他们的Stealth-plugin),我决定使用他们的anonymize-ua 插件来随机更改默认用户代理以进一步减少检测。
我尝试按照他们的解释进行操作,但是当我记录浏览器的实际用户代理时,它似乎没有生效。
下面附上我正在做的一个例子:
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import UserAgent from 'user-agents';
const scrape = async (url: string) => {
// Set stealth plugin
const stealthPlugin = StealthPlugin();
puppeteer.use(stealthPlugin);
// Create random user-agent to be set through plugin
const userAgent = new UserAgent({ platform: 'MacIntel', deviceCategory: 'desktop' });
const userAgentStr = userAgent.toString();
console.log(`User Agent: ${userAgentStr}`);
const anonymizeUserAgentPlugin = require('puppeteer-extra-plugin-anonymize-ua')({
customFn: () => userAgentStr
});
puppeteer.use(anonymizeUserAgentPlugin);
puppeteer
.launch({ headless: …Run Code Online (Sandbox Code Playgroud) I'm using AWS SDK (v3) in my NodeJS/Typescript application, specifically their DynamoDBDocumentClient to easily marshall/unmarshall my entities to reduce the amount of code needed to query the database.
As my entities are complex objects, meaning that an instance holds, for example, another class-type, or a array of them; I couldn't find any tutorials online to explain what I'm missing (maybe I'm not and that is how things need to be done) as the document-client makes me marshall them …
node.js amazon-dynamodb typescript dynamodb-queries aws-sdk-js-v3
如您在下面的示例代码中所看到的,我将Puppeteer与Node中的一组工作人员一起使用,以通过给定的URL运行多个网站截图请求:
const cluster = require('cluster');
const express = require('express');
const bodyParser = require('body-parser');
const puppeteer = require('puppeteer');
async function getScreenshot(domain) {
let screenshot;
const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'] });
const page = await browser.newPage();
try {
await page.goto('http://' + domain + '/', { timeout: 60000, waitUntil: 'networkidle2' });
} catch (error) {
try {
await page.goto('http://' + domain + '/', { timeout: 120000, waitUntil: 'networkidle2' });
screenshot = await page.screenshot({ type: 'png', encoding: 'base64' });
} …Run Code Online (Sandbox Code Playgroud) node.js web-scraping node-cluster google-chrome-headless puppeteer
node.js ×2
puppeteer ×2
typescript ×2
chromium ×1
node-cluster ×1
user-agent ×1
web-scraping ×1