Selenium HtmlUnitDriver Web Scrape从EC2 Server获得了Captcha页面

Question

Selenium HtmlUnitDriver Web Scrape从EC2 Server获得了Captcha页面

use*_*934 7 selenium htmlunit web-scraping selenium-webdriver htmlunit-driver

我写了一个简单的网络刮刀来抓expedia.com.使用Java Selenium HtmlUnitDriver,如果我在本地运行它,我能够成功从网站上抓取数据.

然而,当我上到EC2服务器部署此,它总是返回我在哪里Expedia的检测它作为一个机器人的页面,因此,它会显示这个验证码,以证明人类正在访问它.

我认为它可能与ecpedia服务器的IP地址有关,这些服务器被expedia.com以某种方式列入黑名单？

我试过抓不同的网站,他们不关心/不做人体测试.

知道如何解决这个问题吗？

我尝试但仍被检测为机器人的东西:

将用户代理更改为我在本地浏览器上使用的内容
设置代理

更新:实际设置代理服务器给我一个不同的错误:

当前网址为https://www.expedia.com/things-to-do/search?location=Paris&pageNumber=1

htmlString:

<!--?xml version="1.0" encoding="ISO-8859-1"?-->
<html>
 <head> 
  <title>
      500 Internal Server Error
    </title> 
 </head> 
 <body> 
  <h1> Internal Server Error </h1> 
  <p> The server encountered an internal error or misconfiguration and was unable to complete your request. </p> 
  <p> Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error. </p> 
  <p> More information about this error may be available in the server error log. </p> 
  <hr> 
  <address> Apache/2.4.18 (Ubuntu) Server at www.expedia.com Port 443 </address>   
 </body>
</html>

Run Code Online (Sandbox Code Playgroud)

Answer 1

bke*_*mer 2

您是否涵盖以下主题：

- 您使用哪个代理？确保您使用的代理与人工导航中使用的代理相同，更多详细信息请参阅此链接。

- 您是否在导航中插入等待？如果页面加载后您尝试单击或导航，则这不是模拟常规导航。更多细节。

-您使用的是哪个驱动程序，chromedriver 有一个技巧，可以将内部变量“cdc_”重命名为其他名称，例如“aaa_”，然后如果服务器中有 JavaScript 代码尝试检测此变量（cdc_），它将失败。更多细节。

-如果确实需要不被服务器检测到，还有更多的东西需要研究：

-Is there a honeypot in place?
-Are your IP (EC2 IP) already blocked? You could redirect using a VPN tunnel.

Run Code Online (Sandbox Code Playgroud)

有趣的文章：

https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html

https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headless.html

https://intoli.com/blog/making-chrome-headless-unDetectable/

归档时间：	7 年，6 月前
查看次数：	343 次
最近记录：	7 年，5 月前