网页抓取填写(并检索)搜索表单?

hat*_*rix 6 forms search screen-scraping doi

我想知道是否有可能"自动化"输入条目以搜索表单并从结果中提取匹配项的任务.例如,我有一份期刊文章列表,我想获得DOI(数字对象标识符); 手动为此我会去期刊文章搜索页面(例如,http://pubs.acs.org/search/advanced),输入作者/标题/卷(等),然后找到它的文章返回结果列表,然后选择DOI并将其粘贴到我的参考列表中.我经常使用R和Python进行数据分析(我的灵感来自于RCurl上的一篇文章),但对网络协议知之甚少......这样的事情是否可能(例如使用类似Python的BeautifulSoup?).做任何类似于此任务的远程操作都有什么好的参考吗?我对学习网络抓取和网络抓取工具一样兴趣,就像完成这项特殊任务一样...感谢您的时间!

mix*_*nic 9

美丽的汤非常适合解析网页 - 这是您想要做的事情的一半.Python,Perl和Ruby都有一个版本的Mechanize,那是另一半:

http://wwwsearch.sourceforge.net/mechanize/

机械化让你控制一个浏览器:

# Follow a link
browser.follow_link(link_node)

# Submit a form
browser.select_form(name="search")
browser["authors"] = ["author #1", "author #2"]
browser["volume"] = "any"
search_response = br.submit()
Run Code Online (Sandbox Code Playgroud)

使用Mechanize和Beautiful Soup,您将有一个良好的开端.我考虑的另一个工具是Firebug,就像在这个快速红宝石刮擦指南中使用的那样:

http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/

Firebug可以加速构建xpath以解析文档,为您节省大量时间.

祝好运!


小智 5

Python 代码:用于搜索表单。

# import 
from selenium import webdriver

from selenium.common.exceptions import TimeoutException

from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0

from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# go to the google home page
driver.get("http://www.google.com")

# the page is ajaxy so the title is originally this:
print driver.title

# find the element that's name attribute is q (the google search box)
inputElement = driver.find_element_by_name("q")

# type in the search
inputElement.send_keys("cheese!")

# submit the form (although google automatically searches now without submitting)
inputElement.submit()

try:
    # we have to wait for the page to refresh, the last thing that seems to be updated is the title
    WebDriverWait(driver, 10).until(EC.title_contains("cheese!"))

    # You should see "cheese! - Google Search"
    print driver.title

finally:
    driver.quit()
Run Code Online (Sandbox Code Playgroud)

来源:https : //www.seleniumhq.org/docs/03_webdriver.jsp