hat*_*rix 6 forms search screen-scraping doi
我想知道是否有可能"自动化"输入条目以搜索表单并从结果中提取匹配项的任务.例如,我有一份期刊文章列表,我想获得DOI(数字对象标识符); 手动为此我会去期刊文章搜索页面(例如,http://pubs.acs.org/search/advanced),输入作者/标题/卷(等),然后找到它的文章返回结果列表,然后选择DOI并将其粘贴到我的参考列表中.我经常使用R和Python进行数据分析(我的灵感来自于RCurl上的一篇文章),但对网络协议知之甚少......这样的事情是否可能(例如使用类似Python的BeautifulSoup?).做任何类似于此任务的远程操作都有什么好的参考吗?我对学习网络抓取和网络抓取工具一样兴趣,就像完成这项特殊任务一样...感谢您的时间!
美丽的汤非常适合解析网页 - 这是您想要做的事情的一半.Python,Perl和Ruby都有一个版本的Mechanize,那是另一半:
http://wwwsearch.sourceforge.net/mechanize/
机械化让你控制一个浏览器:
# Follow a link
browser.follow_link(link_node)
# Submit a form
browser.select_form(name="search")
browser["authors"] = ["author #1", "author #2"]
browser["volume"] = "any"
search_response = br.submit()
Run Code Online (Sandbox Code Playgroud)
使用Mechanize和Beautiful Soup,您将有一个良好的开端.我考虑的另一个工具是Firebug,就像在这个快速红宝石刮擦指南中使用的那样:
http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Firebug可以加速构建xpath以解析文档,为您节省大量时间.
祝好运!
小智 5
Python 代码:用于搜索表单。
# import
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# go to the google home page
driver.get("http://www.google.com")
# the page is ajaxy so the title is originally this:
print driver.title
# find the element that's name attribute is q (the google search box)
inputElement = driver.find_element_by_name("q")
# type in the search
inputElement.send_keys("cheese!")
# submit the form (although google automatically searches now without submitting)
inputElement.submit()
try:
# we have to wait for the page to refresh, the last thing that seems to be updated is the title
WebDriverWait(driver, 10).until(EC.title_contains("cheese!"))
# You should see "cheese! - Google Search"
print driver.title
finally:
driver.quit()
Run Code Online (Sandbox Code Playgroud)
来源:https : //www.seleniumhq.org/docs/03_webdriver.jsp
| 归档时间: |
|
| 查看次数: |
8150 次 |
| 最近记录: |