使用Python(或R)提取Google学术搜索结果

Question

使用Python(或R)提取Google学术搜索结果

I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is scholar.py (can that one be used as a python library?).

Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious!

The problem is when I use gscholar.py as explained in the README file, I get as a result

query() takes at least 2 arguments (1 given).

Even when I specify another argument (e.g. gscholar.query("my query", allresults=True), I get

query() takes at least 2 arguments (2 given).

This puzzles me. I also tried to specify the third possible argument (outformat=4; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem?

我发现R的代码(参见链接)作为解决方案,但很快被谷歌阻止了.也许有人可以建议如何改进代码以避免被阻止？任何帮助,将不胜感激!谢谢!

Answer 1

Jul*_*lia 13

我建议您不要使用特定的库来抓取特定的网站,而是使用经过充分测试并具有良好格式的文档(如BeautifulSoup)的通用HTML库.

要访问带有浏览器信息的网站,您可以使用带有自定义用户代理的url opener类:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

Run Code Online (Sandbox Code Playgroud)

然后按如下方式下载所需的URL:

openurl(url).read()

Run Code Online (Sandbox Code Playgroud)

要检索学者结果,只需使用http://scholar.google.se/scholar?hl=en&q=${query}url.

要从检索到的HTML文件中提取信息,您可以使用以下代码:

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

Run Code Online (Sandbox Code Playgroud)

这段代码提取了一个具体div元素,其中包含Google学术搜索结果页面中显示的结果数.

Answer 2

0x9*_*x90 5

谷歌会阻止你......因为很明显你不是浏览器.也就是说,他们会检测到人类活动过于频繁发生的相同请求签名....

你可以做:

如何在Python中通过Tor制作urllib2请求？
在大学计算机上运行代码(可能没有帮助)
使用Google学者API可能会花费您的金钱而不会为您提供完整的功能,因为您可以看作是一个受到伤害的普通用户.

@0x90 对于太多请求，学术行为也相同。 (2认同)

归档时间：	13 年，3 月前
查看次数：	17621 次
最近记录：	10 年，10 月前