如何使用ruby在"<a>"标签中找到href元素值

gal*_*gal 0 ruby screen-scraping hpricot ruby-on-rails mechanize

我的目标是在谷歌搜索结果中找到第一个结果并收集网站链接,所以我构建了这个脚本:

require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
Run Code Online (Sandbox Code Playgroud)

我得到一个像这样的字符串:

url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>
Run Code Online (Sandbox Code Playgroud)

但我只需要链接(http://en.wikipedia.org/wiki/Gallon)并不是所有的HTML代码......我怎么能这样做?我正在使用宝石:

require 'hpricot'
require 'open-uri'
require 'mechanize'
Run Code Online (Sandbox Code Playgroud)

Jak*_*mpl 6

机械化包括nokogiri你 能够应该完全跳过hpricot.它会不必要地降低代码速度.你实际上两次做同样的事情.

require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)

puts search_results.links[16].href
Run Code Online (Sandbox Code Playgroud)


Jon*_*röm 6

你可以得到的属性值是这样

(doc/"a")[16].attributes['href']
Run Code Online (Sandbox Code Playgroud)

但我不得不说神奇数字 16似乎很脆弱.

您也不应该抓取搜索结果,您应该考虑使用自定义搜索API.