可以说我给出的情况像这个页面
<div id="details-container" class="style-scope ytd-channel-about-metadata-renderer">
<yt-formatted-string class="subheadline style-scope ytd-channel-about-metadata-renderer">Details</yt-formatted-string>
<table class="style-scope ytd-channel-about-metadata-renderer">
<tbody class="style-scope ytd-channel-about-metadata-renderer"><tr class="style-scope ytd-channel-about-metadata-renderer">
<td class="label style-scope ytd-channel-about-metadata-renderer">
<yt-formatted-string class="style-scope ytd-channel-about-metadata-renderer"></yt-formatted-string>
</td>
<td class="style-scope ytd-channel-about-metadata-renderer">
<ytd-button-renderer align-by-text="" class="style-scope ytd-channel-about-metadata-renderer" button-renderer=""></ytd-button-renderer>
<div id="captcha-container" class="style-scope ytd-channel-about-metadata-renderer"></div>
<div id="email-container" class="style-scope ytd-channel-about-metadata-renderer"></div>
<a id="email" target="_blank" class="style-scope ytd-channel-about-metadata-renderer" href="mailto:undefined" hidden=""></a>
</td>
</tr>
<tr class="style-scope ytd-channel-about-metadata-renderer">
<td class="label style-scope ytd-channel-about-metadata-renderer">
<yt-formatted-string class="style-scope ytd-channel-about-metadata-renderer"><span class="deemphasize style-scope yt-formatted-string"> Location: </span></yt-formatted-string>
</td>
<td class="style-scope ytd-channel-about-metadata-renderer">
<yt-formatted-string class="style-scope ytd-channel-about-metadata-renderer">YourCountry</yt-formatted-string>
</td>
</tr>
</tbody></table>
</div>
Run Code Online (Sandbox Code Playgroud)
假设我需要获取“YourCountry”,我实际上如何获取此元素?
到目前为止我尝试过:
const location …Run Code Online (Sandbox Code Playgroud) Apify可以从sitemap.xml中抓取链接
const Apify = require('apify');
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
});
await requestList.initialize();
const crawler = new Apify.PuppeteerCrawler({
requestList,
handlePageFunction: async ({ page, request }) => {
console.log(`Processing ${request.url}...`);
await Apify.pushData({
url: request.url,
title: await page.title(),
html: await page.content(),
});
},
});
await crawler.run();
console.log('Done.');
});
Run Code Online (Sandbox Code Playgroud)
https://sdk.apify.com/docs/examples/puppeteersitemap#docsNav
但是,如果我使用 requestQueue,我不确定如何从 sitemap.xml 中抓取链接。例如:
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({url: "https://google.com});
//this is not working. Apify is simply crawling sitemap.xml
//and not adding …Run Code Online (Sandbox Code Playgroud) 据我从各种博客了解到,像 2captcha 这样的网站是一种人工图像和验证码识别服务。它的主要目的是由始终在线接收我的验证码的员工快速准确地解决您的验证码,并最终解决相同的问题。
现在让我们以https://www.google.com/recaptcha/api2/demo为例。假设生成了一个验证码,2个验证码就像data-sitekey为每个验证码生成的服务需求。
data-sitekey="6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"
Run Code Online (Sandbox Code Playgroud)
现在我不明白的是,验证码解算器如何仅使用数据站点密钥在其端复制/重现验证码。谷歌是否提供任何服务来复制相同的内容?
另一端的人如何接收自己这边的相同验证码、解决它并将其发回?
我正在使用Apify 的 puppeteer登录该网站。我确实研究过类似的问题,但无济于事。
我无法找到链接登录页面上看到的主登录按钮的可点击 ID/元素。目前,我的代码如下所示:
const Apify = require('apify');
Apify.main(async () => {
const input = await Apify.getValue('INPUT');
const browser = await Apify.launchPuppeteer();
const page = await browser.newPage();
await page.goto('https://www.sunpass.com/vector/account/home/accountLogin.do');
// Login
await page.type('#tt_username1', input.username);
await page.type('#tt_loginPassword1', input.password);
await page.waitFor(2000);
await page.click('#entryform input');
await page.waitForNavigation();
// Get cookies
const cookies = await page.cookies();
// Use cookies in other tab or browser
const page2 = await browser.newPage();
await page2.setCookie(...cookies);
await page2.goto('https://www.sunpass.com/vector/account/transactions/webtransactionSearch.do'); // Opens page as …Run Code Online (Sandbox Code Playgroud) apify ×4
puppeteer ×3
2captcha ×1
button ×1
captcha ×1
clickable ×1
javascript ×1
recaptcha ×1
web-scraping ×1