mau*_*bio 4 python beautifulsoup google-scholar
在Google学术搜索中进行典型的关键字搜索(请参见屏幕截图)后,我想获得一个字典,其中包含出现在页面上的每个出版物的标题和网址(例如results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}。
要从Google学术搜索检索结果页面,我使用以下代码:
from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup
class AppURLOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page
Run Code Online (Sandbox Code Playgroud)
此代码以(非常难看的)HTML格式正确返回结果页面。但是,由于无法确定如何使用BeautifulSoup(我不太熟悉)来解析结果页面并检索数据,因此我无法继续前进。
请注意,问题在于解析和从结果页面提取数据,而不是Google Scholar本身,因为上面的代码可以正确检索结果页面。
任何人都可以给我一些提示吗?提前致谢!
检查页面内容将显示搜索结果被包装在h3带有attribute 的标签中class="gs_rt"。您可以使用BeautifulSoup仅提取这些标签,然后从<a>每个条目内的标签中获取标题和URL 。将每个标题/ URL写入字典,然后存储在字典列表中:
import requests
from bs4 import BeautifulSoup
query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
Run Code Online (Sandbox Code Playgroud)
输出:
[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
'url': 'https://www.nature.com/articles/338427a0'},
{'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
...]
Run Code Online (Sandbox Code Playgroud)
注意:我使用requests代替urllib,因为我urllib不会加载FancyURLopener。但是,无论您如何获取页面内容,BeautifulSoup语法都应该相同。
| 归档时间: |
|
| 查看次数: |
1502 次 |
| 最近记录: |