php curl 使用 cloudflare 2021 访问网站

Ham*_*y75 6 php curl cloudflare

我多年来一直使用curl 解析网站,但我对网站有一些未知的东西。检查它使用 cloudfires 返回的内容并进行调查,我发现它使用某种机制来忽略机器人但允许用户。

我不明白它是如何做到这一点的,因为它在发送之前返回 403 代码,但如果我对 chrome 做同样的事情,它就可以正常工作。

我已经从 chrome 的检查器测试了“curl to bash and command line options”,结果相同

这是我正在使用的代码:

$headers=array(
    'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language: es-ES,es;q=0.9',
    'upgrade-insecure-requests: 1',
    //'Referrer Policy: strict-origin-when-cross-origin',
    //'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
    );
    
    $agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36";


$url="https://www.pccomponentes.com/";

//$agent= 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$agent = 'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)';

$ch = curl_init();
//curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, 0);
//curl_setopt($ch, CURLOPT_POST, 0);
//curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_MAXREDIRS, 20);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
//curl_setopt($ch, CURLOPT_LOW_SPEED_LIMIT, 1); 
//curl_setopt($ch, CURLOPT_LOW_SPEED_TIME, 360); 
//curl_setopt($ch, CURLOPT_IGNORE_CONTENT_LENGTH, 1); 
//curl_setopt($ch, CURLOPT_TCP_NODELAY, 1); 
curl_setopt($ch, CURLOPT_HTTPHEADER,$headers);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
echo "code: ".curl_getinfo($ch,CURLINFO_HTTP_CODE ).PHP_EOL;
//echo $result;
Run Code Online (Sandbox Code Playgroud)

你可以在评论中看到我检查了很多不同的解决方案、不同的代理、不同的卷曲选项,但我总是得到 403 代码。

curl 命令行 sh 代码是

curl -I -vvv 'https://www.pccomponentes.com/' \
  -H 'authority: www.pccomponentes.com' \
  -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-fetch-dest: document' \
  -H 'accept-language: es-ES,es;q=0.9' \
  --compressed
Run Code Online (Sandbox Code Playgroud)

为了检查谷歌浏览器,我打开一个根本没有cookie的安全窗口,然后我打开检查器并写下url。

脚本的输出(与命令行curl相同)是

*   Trying 104.16.162.71:443...
* TCP_NODELAY set
* Connected to www.pccomponentes.com (104.16.162.71) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=CA; L=San Francisco; O=Cloudflare, Inc.; CN=sni.cloudflaressl.com
*  start date: Aug 11 00:00:00 2020 GMT
*  expire date: Aug 11 12:00:00 2021 GMT
*  subjectAltName: host "www.pccomponentes.com" matched cert's "*.pccomponentes.com"
*  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0xaaab008552b0)
> GET /listado/ajax?idShops%5B%5D=0&page=0&order=price-desc&gtmTitle=Tarjetas%20Gr%C3%A1ficas&idFamilies%5B%5D=6 HTTP/2
Host: www.pccomponentes.com
user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-language: es-ES,es;q=0.9
upgrade-insecure-requests: 1

* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 403 
< date: Sat, 01 May 2021 09:28:32 GMT
< content-type: text/html; charset=UTF-8
< cf-chl-bypass: 1
< set-cookie: __cfduid=db6d6b293bbc3a77fe7f7b90ec55cebc31619861312; expires=Mon, 31-May-21 09:28:32 GMT; path=/; domain=.pccomponentes.com; HttpOnly; SameSite=Lax
< cache-control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< expires: Thu, 01 Jan 1970 00:00:01 GMT
< x-frame-options: SAMEORIGIN
< cf-request-id: 09c8db2a8c0000611f910c2000000001
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< server: cloudflare
< cf-ray: 6487faf0d82d611f-BCN
< 
* Connection #0 to host www.pccomponentes.com left intact
code: 403
Run Code Online (Sandbox Code Playgroud)

我一直在寻找信息:

  • 旧的 SSL 会话 ID 已过时,正在删除

但没有运气。

它使用什么样的保护?,我看到了一些关于 js 的内容,但当它已经返回 403 代码时,它甚至没有加载。我看到了一些关于 catpcha 的评论,但在发送之前这是不可能的。chrome 返回代码 200 和curl 403。

我也尝试过 HTTP/1.1,使用不同的编码,使用 gzip 等......一点运气都没有。

看来最近他们换了系统。

erf*_*fan 6

cloudflare 检查它收到的标头和请求,以确定发送者是否是机器人。即使没有任何标头和附加项目,您也可以发送请求,如果服务器端未检查,则没有问题,但在检查的情况下,您应该尝试使您的请求与客户的请求相似。浏览器将被发送

这是您第一次打开页面时的默认答案 在浏览器中,如果您第一次打开,结果是 403 但下次就不会这样了,因为 cookie 您可以在您的浏览器中使用相同的 cookie要求

在此输入图像描述

用于测试:您可以删除您想要的cookie并重新加载页面第一次,如果您没有cookie,您将再次遇到403和验证码

在此输入图像描述

例子:

$options = [
    CURLOPT_URL => "https://www.pccomponentes.com/",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_SSL_VERIFYHOST => false,
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_HTTPHEADER => [
        'accept: application/json, text/plain, */*',
        'Accept-Language: en-US,en;q=0.5',
        'x-application-type: WebClient',
        'x-client-version: 2.10.4',
        'Origin: https://www.googe.com',
        'user-agent: Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0',
    ]
];

$ch = curl_init();
curl_setopt_array($ch, $options);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
Run Code Online (Sandbox Code Playgroud)

结果:

在此输入图像描述

你从php发送的请求没有cookie,所以你总是会遇到403你可以使用CURLOPT_COOKIEJAR和CURLOPT_COOKIEFILE在php中与curl一起使用cookie

https://curl.se/docs/http-cookies.html

  • ```cloudflare 检查它收到的标头和请求,以确定发送者是否是机器人``` - 这还不是全部,现在 Cloudflare 还检查客户端 TLS 实现中的细微差异,以检测它是否是 libcurl -libcurl,我不记得确切的细节,但即使标头和请求 100% 相同,cloudflare 仍然可以从 TLS 协商中判断其是否基于 libcurl。 (3认同)