使用 Beautiful Soup 从 Google 搜索中提取数据/链接

2 html javascript python beautifulsoup

各位晚上好,

我试图向 Google 提出一个问题,并从其受尊重的搜索查询中提取所有相关链接(即我搜索“site: Wikipedia.com Thomas Jefferson”,它给了我 wiki.com/jeff、wiki.com/tom、 ETC。)

这是我的代码:

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes

soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
Run Code Online (Sandbox Code Playgroud)

这里的目标是我设置查询变量,让 python 查询 Google,并且如果您愿意的话,Beautiful Soup 会拉出所有“绿色”链接。

这是 Google 结果页面的图片

我只想完全拉动绿色链接。奇怪的是,Google 的源代码是“隐藏的”(其搜索架构的一个症状),因此 Beautiful Soup 不能直接从 h3 标签中提取 href。当我检查元素时,我能够看到 h3 href,但当我查看源代码时却看不到。

这是检查元素的图片

我的问题是:如果我无法访问他们的源代码,只能检查元素,我该如何通过 BeautifulSoup 从 Google 提取前 5 个最相关的绿色链接?

PS:为了让大家了解我想要完成的任务,我发现了两个像我一样相对接近的 Stack Overflow 问题:

美丽的汤从谷歌搜索中提取href

如何使用python收集Google搜索与beautiful soup的数据

wpe*_*rcy 5

当我尝试禁用 JavaScript 进行搜索时,我得到的 URL 与 Rob M. 不同 -

https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw
Run Code Online (Sandbox Code Playgroud)

要使其适用于任何查询,您应该首先确保您的查询中没有空格(这就是您会收到 400:错误请求的原因)。您可以使用以下方法执行此操作urllib.quote_plus()

query = "Thomas Jefferson"
query = urllib.quote_plus(query)
Run Code Online (Sandbox Code Playgroud)

它将把所有空格编码为加号 - 创建一个有效的 URL。

然而,这不适用于urllib - 你会得到 403: Forbidden。python-requests我通过使用这样的模块让它工作

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
Run Code Online (Sandbox Code Playgroud)

打印链接给出:

print links
#  [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']
Run Code Online (Sandbox Code Playgroud)