Flo*_*low 11 python r google-scholar
I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is scholar.py (can that one be used as a python library?).
Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious!
The problem is when I use gscholar.py as explained in the README file, I get as a result
query() takes at least 2 arguments (1 given).
Even when I specify another argument (e.g. gscholar.query("my query", allresults=True), I get
query() takes at least 2 arguments (2 given).
This puzzles me. I also tried to specify the third possible argument (outformat=4; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem?
我发现R的代码(参见链接)作为解决方案,但很快被谷歌阻止了.也许有人可以建议如何改进代码以避免被阻止?任何帮助,将不胜感激!谢谢!
Jul*_*lia 13
我建议您不要使用特定的库来抓取特定的网站,而是使用经过充分测试并具有良好格式的文档(如BeautifulSoup)的通用HTML库.
要访问带有浏览器信息的网站,您可以使用带有自定义用户代理的url opener类:
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open
Run Code Online (Sandbox Code Playgroud)
然后按如下方式下载所需的URL:
openurl(url).read()
Run Code Online (Sandbox Code Playgroud)
要检索学者结果,只需使用http://scholar.google.se/scholar?hl=en&q=${query}url.
要从检索到的HTML文件中提取信息,您可以使用以下代码:
from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))
Run Code Online (Sandbox Code Playgroud)
这段代码提取了一个具体div元素,其中包含Google学术搜索结果页面中显示的结果数.
谷歌会阻止你......因为很明显你不是浏览器.也就是说,他们会检测到人类活动过于频繁发生的相同请求签名....
你可以做:
| 归档时间: |
|
| 查看次数: |
17621 次 |
| 最近记录: |