gal*_*gal 0 ruby screen-scraping hpricot ruby-on-rails mechanize
我的目标是在谷歌搜索结果中找到第一个结果并收集网站链接,所以我构建了这个脚本:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
Run Code Online (Sandbox Code Playgroud)
我得到一个像这样的字符串:
url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>
Run Code Online (Sandbox Code Playgroud)
但我只需要链接(http://en.wikipedia.org/wiki/Gallon)并不是所有的HTML代码......我怎么能这样做?我正在使用宝石:
require 'hpricot'
require 'open-uri'
require 'mechanize'
Run Code Online (Sandbox Code Playgroud)
机械化包括nokogiri你 能够应该完全跳过hpricot.它会不必要地降低代码速度.你实际上两次做同样的事情.
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
Run Code Online (Sandbox Code Playgroud)
你可以得到的属性值是这样
(doc/"a")[16].attributes['href']
Run Code Online (Sandbox Code Playgroud)
但我不得不说神奇数字 16似乎很脆弱.
您也不应该抓取搜索结果,您应该考虑使用自定义搜索API.
| 归档时间: |
|
| 查看次数: |
6297 次 |
| 最近记录: |