无法使用请求模块从静态网页中抓取不同的公司名称

MIT*_*THU -2 python beautifulsoup web-scraping python-3.x python-requests

我创建了一个脚本来使用请求模块从该网站收集不同的公司名称,但是当我执行该脚本时,它最终什么也没得到。我在页面源中查找了公司名称,发现这些名称在那里可用,因此它们似乎是静态的。

import requests
from bs4 import BeautifulSoup

link = 'https://clutch.co/agencies/digital-marketing'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("h3.company_info > a"):
        print(item.text)
Run Code Online (Sandbox Code Playgroud)

bad*_*ker 8

根据下面代码的输出,您的站点返回 a status codeof 403这意味着客户端被禁止访问有效的 URL。

此响应的标头表明该站点受以下保护Cloudflare

“服务器”:“cloudflare”,“CF-RAY”:“78d95f0bafebad68-ATL”

import requests

link = 'https://clutch.co/agencies/digital-marketing'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    print(res.status_code)
    403
    print('\n')
    print(res.headers)
    {'Date': 'Sun, 22 Jan 2023 15:37:30 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Permissions-Policy': 'accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()', 'Referrer-Policy': 'same-origin', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=SR3boTo67liuRCP9u9YJcmvRZKWm5jFrnJcxtKXB42c-1674401850-0-AWOak5THdaypQLptfJnLhSTY5z2JO5+6rWurdKJQLQBPXB5tYhE0Z4NYGvJ3mjcG89KTFEkgKruhJ8XN/kTnfpo=; path=/; expires=Sun, 22-Jan-23 16:07:30 GMT; domain=.clutch.co; HttpOnly; Secure; SameSite=None', 'Vary': 'Accept-Encoding', 'Strict-Transport-Security': 'max-age=2592000', 'Server': 'cloudflare', 'CF-RAY': '78d95f0bafebad68-ATL', 'Content-Encoding': 'gzip', 'alt-svc': 'h3=":443"; ma=86400, h3-29=":443"; ma=86400'}

Run Code Online (Sandbox Code Playgroud)

由于该网站受 Cloudflare 保护,因此有一个名为cloudscraper的 Python 模块试图绕过 Cloudflare 的反机器人页面。

使用该模块,您可以获得所需的数据。

例如:

import requests

link = 'https://clutch.co/agencies/digital-marketing'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    print(res.status_code)
    403
    print('\n')
    print(res.headers)
    {'Date': 'Sun, 22 Jan 2023 15:37:30 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Permissions-Policy': 'accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()', 'Referrer-Policy': 'same-origin', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=SR3boTo67liuRCP9u9YJcmvRZKWm5jFrnJcxtKXB42c-1674401850-0-AWOak5THdaypQLptfJnLhSTY5z2JO5+6rWurdKJQLQBPXB5tYhE0Z4NYGvJ3mjcG89KTFEkgKruhJ8XN/kTnfpo=; path=/; expires=Sun, 22-Jan-23 16:07:30 GMT; domain=.clutch.co; HttpOnly; Secure; SameSite=None', 'Vary': 'Accept-Encoding', 'Strict-Transport-Security': 'max-age=2592000', 'Server': 'cloudflare', 'CF-RAY': '78d95f0bafebad68-ATL', 'Content-Encoding': 'gzip', 'alt-svc': 'h3=":443"; ma=86400, h3-29=":443"; ma=86400'}

Run Code Online (Sandbox Code Playgroud)

这应该打印:

| Company                          | URL                                                        |
|----------------------------------|------------------------------------------------------------|
| WebFX                            | https://clutch.co/profile/webfx                            |
| Ignite Visibility                | https://clutch.co/profile/ignite-visibility                |
| SocialSEO                        | https://clutch.co/profile/socialseo                        |
| Lilo Social                      | https://clutch.co/profile/lilo-social                      |
| Favoured                         | https://clutch.co/profile/favoured                         |
| Power Digital                    | https://clutch.co/profile/power-digital                    |
| Belkins                          | https://clutch.co/profile/belkins                          |
| SmartSites                       | https://clutch.co/profile/smartsites                       |
| Straight North                   | https://clutch.co/profile/straight-north                   |
| Victorious                       | https://clutch.co/profile/victorious                       |
| Uplers                           | https://clutch.co/profile/uplers                           |
| Daniel Brian Advertising         | https://clutch.co/profile/daniel-brian-advertising         |
| Thrive Internet Marketing Agency | https://clutch.co/profile/thrive-internet-marketing-agency |
| Big Leap                         | https://clutch.co/profile/big-leap                         |
| Mad Fish Digital                 | https://clutch.co/profile/mad-fish-digital                 |
| Razor Rank                       | https://clutch.co/profile/razor-rank                       |
| Brolik                           | https://clutch.co/profile/brolik                           |
| Search Berg                      | https://clutch.co/profile/search-berg                      |
| Socialfix Media                  | https://clutch.co/profile/socialfix-media                  |
| Kanbar Digital, LLC              | https://clutch.co/profile/kanbar-digital                   |
| NextLeft                         | https://clutch.co/profile/nextleft                         |
| Fruition                         | https://clutch.co/profile/fruition                         |
| Impactable                       | https://clutch.co/profile/impactable                       |
| Lets Tok                         | https://clutch.co/profile/lets-tok                         |
| Pyxl                             | https://clutch.co/profile/pyxl                             |
| Sagefrog Marketing Group         | https://clutch.co/profile/sagefrog-marketing-group         |
| Foreignerds INC.                 | https://clutch.co/profile/foreignerds                      |
| Social Driver                    | https://clutch.co/profile/social-driver                    |
| 3 Media Web                      | https://clutch.co/profile/3-media-web                      |
| Brand Vision                     | https://clutch.co/profile/brand-vision-1                   |
Run Code Online (Sandbox Code Playgroud)

  • 这可能与您的位置和您使用的 VPN 服务器有关。我猜你爬了很多,你的 IP 可能已经被列入黑名单了。我已经在几个 VPN 位置(德国、法国、英国和瑞典)尝试过该脚本,每次都有效。尝试一些代理或使用像样的 VPN 服务。 (5认同)
  • 这有效。它来自日本。来自荷兰。来自美国。它通过几个 TOR 出口节点运行。来自好的和坏的VPN。无论白天还是黑夜,眼睛都闭着、睁开。这个答案应该得到赏金。 (2认同)
  • @MITHU 你设法重新测试代码了吗? (2认同)