urllib.error.HTTPError: HTTP Error 429: Too Many Requests in search() of googlesearch package of Python

shu*_*ham 6 google-search python-3.x selenium-chromedriver

实际上,我在 python 的 GoogleSearch 包的 search() 中运行一个查询,它以列表格式提供来自 google 搜索的多个链接

search(query, tld='com', lang='en', num=20, start=0, stop=None, pause=2.0):
Run Code Online (Sandbox Code Playgroud)

我也能得到结果,但一段时间后它给出了一个错误

for i in search(query, tld='com', lang='en', num=20, start=0, stop=None, pause=2.0):
  File "E:\crawling\venv\lib\site-packages\googlesearch\__init__.py", line 312, in search
    html = get_page(url, user_agent)
  File "E:\crawling\venv\lib\site-packages\googlesearch\__init__.py", line 176, in get_page
    response = urlopen(request)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 563, in error
    result = self._call_chain(*args)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 755, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests
Run Code Online (Sandbox Code Playgroud)

我还增加了搜索参数中的暂停时间,但对我没有帮助。

小智 0

看来您正在使用官方 Google 搜索 API,但它已被正式弃用

\n

不过,第三方 API SerpApi提供了Google 搜索引擎结果 API替代方案。这是一个带有免费计划的付费 API。

\n

它将绕过来自 Google 和其他搜索引擎的块(包括验证码),并且无需创建解析器并维护它。

\n

它有一个分页功能,可以从所有 Google 搜索页面中提取数据,该页面在循环中使用,while True其中分页参数将通过退出循环的条件来指定:

\n
while True:\n    results = search.get_dict()     # JSON -> Python dictionary\n    \n    page_num += 1\n    \n    for result in results["organic_results"]:                   \n        organic_results_data.append({\n            "page_num": page_num,\n            "title": result.get("title"),\n            "link": result.get("link"),\n            "displayed_link": result.get("displayed_link"),   \n        })\n    \n    if "next_link" in results.get("serpapi_pagination", []):\n        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))\n    else:\n        break\n
Run Code Online (Sandbox Code Playgroud)\n

分页条件是:如果下一页存在,则发生到该页面的转换,并且下一个请求已经发生在该页面上,因此它将继续,直到下一页存在。一旦下一页丢失,循环就会停止:

\n
if "next_link" in results.get("serpapi_pagination", []):\n        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))        # updating search parameters for the next page\n    else:\n        break\n
Run Code Online (Sandbox Code Playgroud)\n

在线 IDE中的分页代码和示例

\n
from serpapi import GoogleSearch\nfrom urllib.parse import urlsplit, parse_qsl\nimport json, os\n\nparams = {\n  "api_key": os.getenv("API_KEY"),  # serpapi key\n  "engine": "google",               # serpapi parser engine\n  "q": "google",                    # search query\n  "num": "100"                      # number of results per page (100 per page in this case)\n  # other search parameters: https://serpapi.com/search-api#api-parameters\n}\n\nsearch = GoogleSearch(params)       # where data extraction happens\n\norganic_results_data = []\npage_num = 0\n\nwhile True:\n    results = search.get_dict()     # JSON -> Python dictionary\n    \n    page_num += 1\n    \n    for result in results["organic_results"]:\n        organic_results_data.append({\n            "page_num": page_num,\n            "title": result.get("title"),\n            "link": result.get("link"),\n            "displayed_link": result.get("displayed_link"),   \n        })\n    \n    if "next_link" in results.get("serpapi_pagination", []):\n        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))\n    else:\n        break\n    \nprint(json.dumps(organic_results_data, indent=2, ensure_ascii=False))\n
Run Code Online (Sandbox Code Playgroud)\n

输出示例:

\n
[\n  {\n    "page_num": 1,\n    "title": "Google",\n    "link": "https://www.google.com/",\n    "displayed_link": "https://www.google.com"\n  },\n  {\n    "page_num": 1,\n    "title": "Google - About Google, Our Culture & Company News",\n    "link": "https://about.google/",\n    "displayed_link": "https://about.google"\n  },\n  {\n    "page_num": 1,\n    "title": "The Keyword | Google",\n    "link": "https://blog.google/",\n    "displayed_link": "https://blog.google"\n  },\n  {\n    "page_num": 1,\n    "title": "Google - Twitter",\n    "link": "https://twitter.com/google",\n    "displayed_link": "https://twitter.com \xe2\x80\xba google"\n  },\n  {\n    "page_num": 1,\n    "title": "Google - Home | Facebook",\n    "link": "https://www.facebook.com/Google/",\n    "displayed_link": "https://www.facebook.com \xe2\x80\xba Google"\n  },\n  #...\n]\n
Run Code Online (Sandbox Code Playgroud)\n
\n

免责声明,我为 SerpApi 工作。

\n
\n