使用Python和BeautifulSoup解析Google Scholar结果

Question

使用Python和BeautifulSoup解析Google Scholar结果

mau*_*bio 4 python beautifulsoup google-scholar

在Google学术搜索中进行典型的关键字搜索（请参见屏幕截图）后，我想获得一个字典，其中包含出现在页面上的每个出版物的标题和网址（例如results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}。

要从Google学术搜索检索结果页面，我使用以下代码：

from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup

class AppURLOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'

openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page

Run Code Online (Sandbox Code Playgroud)

此代码以（非常难看的）HTML格式正确返回结果页面。但是，由于无法确定如何使用BeautifulSoup（我不太熟悉）来解析结果页面并检索数据，因此我无法继续前进。

请注意，问题在于解析和从结果页面提取数据，而不是Google Scholar本身，因为上面的代码可以正确检索结果页面。

任何人都可以给我一些提示吗？提前致谢！

Answer 1

and*_*ece 6

检查页面内容将显示搜索结果被包装在h3带有attribute 的标签中class="gs_rt"。您可以使用BeautifulSoup仅提取这些标签，然后从<a>每个条目内的标签中获取标题和URL 。将每个标题/ URL写入字典，然后存储在字典列表中：

import requests
from bs4 import BeautifulSoup

query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

Run Code Online (Sandbox Code Playgroud)

输出：

[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
  'url': 'https://www.nature.com/articles/338427a0'},
 {'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
  'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
 ...]

Run Code Online (Sandbox Code Playgroud)

注意：我使用requests代替urllib，因为我urllib不会加载FancyURLopener。但是，无论您如何获取页面内容，BeautifulSoup语法都应该相同。

归档时间：	7 年，5 月前
查看次数：	1502 次
最近记录：	7 年，5 月前