shu*_*ham 6 google-search python-3.x selenium-chromedriver
实际上,我在 python 的 GoogleSearch 包的 search() 中运行一个查询,它以列表格式提供来自 google 搜索的多个链接
search(query, tld='com', lang='en', num=20, start=0, stop=None, pause=2.0):
Run Code Online (Sandbox Code Playgroud)
我也能得到结果,但一段时间后它给出了一个错误
for i in search(query, tld='com', lang='en', num=20, start=0, stop=None, pause=2.0):
File "E:\crawling\venv\lib\site-packages\googlesearch\__init__.py", line 312, in search
html = get_page(url, user_agent)
File "E:\crawling\venv\lib\site-packages\googlesearch\__init__.py", line 176, in get_page
response = urlopen(request)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\shubh\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests
Run Code Online (Sandbox Code Playgroud)
我还增加了搜索参数中的暂停时间,但对我没有帮助。
小智 0
看来您正在使用官方 Google 搜索 API,但它已被正式弃用。
\n不过,第三方 API SerpApi提供了Google 搜索引擎结果 API替代方案。这是一个带有免费计划的付费 API。
\n它将绕过来自 Google 和其他搜索引擎的块(包括验证码),并且无需创建解析器并维护它。
\n它有一个分页功能,可以从所有 Google 搜索页面中提取数据,该页面在循环中使用,while True其中分页参数将通过退出循环的条件来指定:
while True:\n results = search.get_dict() # JSON -> Python dictionary\n \n page_num += 1\n \n for result in results["organic_results"]: \n organic_results_data.append({\n "page_num": page_num,\n "title": result.get("title"),\n "link": result.get("link"),\n "displayed_link": result.get("displayed_link"), \n })\n \n if "next_link" in results.get("serpapi_pagination", []):\n search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))\n else:\n break\nRun Code Online (Sandbox Code Playgroud)\n分页条件是:如果下一页存在,则发生到该页面的转换,并且下一个请求已经发生在该页面上,因此它将继续,直到下一页存在。一旦下一页丢失,循环就会停止:
\nif "next_link" in results.get("serpapi_pagination", []):\n search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query))) # updating search parameters for the next page\n else:\n break\nRun Code Online (Sandbox Code Playgroud)\n在线 IDE中的分页代码和示例
\nfrom serpapi import GoogleSearch\nfrom urllib.parse import urlsplit, parse_qsl\nimport json, os\n\nparams = {\n "api_key": os.getenv("API_KEY"), # serpapi key\n "engine": "google", # serpapi parser engine\n "q": "google", # search query\n "num": "100" # number of results per page (100 per page in this case)\n # other search parameters: https://serpapi.com/search-api#api-parameters\n}\n\nsearch = GoogleSearch(params) # where data extraction happens\n\norganic_results_data = []\npage_num = 0\n\nwhile True:\n results = search.get_dict() # JSON -> Python dictionary\n \n page_num += 1\n \n for result in results["organic_results"]:\n organic_results_data.append({\n "page_num": page_num,\n "title": result.get("title"),\n "link": result.get("link"),\n "displayed_link": result.get("displayed_link"), \n })\n \n if "next_link" in results.get("serpapi_pagination", []):\n search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))\n else:\n break\n \nprint(json.dumps(organic_results_data, indent=2, ensure_ascii=False))\nRun Code Online (Sandbox Code Playgroud)\n输出示例:
\n[\n {\n "page_num": 1,\n "title": "Google",\n "link": "https://www.google.com/",\n "displayed_link": "https://www.google.com"\n },\n {\n "page_num": 1,\n "title": "Google - About Google, Our Culture & Company News",\n "link": "https://about.google/",\n "displayed_link": "https://about.google"\n },\n {\n "page_num": 1,\n "title": "The Keyword | Google",\n "link": "https://blog.google/",\n "displayed_link": "https://blog.google"\n },\n {\n "page_num": 1,\n "title": "Google - Twitter",\n "link": "https://twitter.com/google",\n "displayed_link": "https://twitter.com \xe2\x80\xba google"\n },\n {\n "page_num": 1,\n "title": "Google - Home | Facebook",\n "link": "https://www.facebook.com/Google/",\n "displayed_link": "https://www.facebook.com \xe2\x80\xba Google"\n },\n #...\n]\nRun Code Online (Sandbox Code Playgroud)\n\n\n免责声明,我为 SerpApi 工作。
\n
| 归档时间: |
|
| 查看次数: |
517 次 |
| 最近记录: |