用Nokogiri解析HTTP响应
嗨,我在使用Nokogiri解析HTTPresponse对象时遇到问题.
我用这个函数在这里获取一个网站:
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
url = URI.parse(URI.encode(uri_str.strip))
puts url
#get path
req = Net::HTTP::Get.new(url.path,headers)
#start TCP/IP
response = Net::HTTP.start(url.host,url.port) { |http|
http.request(req)
}
case response
when Net::HTTPSuccess
then #print final redirect to a file
puts "this is location" + uri_str
puts "this is the host #{url.host}"
puts "this is the path #{url.path}"
return response
# if you get a 302 response
when Net::HTTPRedirection
then
puts "this is redirect" + response['location']
return fetch(response['location'],aFile, limit - 1)
else
response.error!
end
end
html = fetch("http://www.somewebsite.com/hahaha/")
puts html
noko = Nokogiri::HTML(html)
Run Code Online (Sandbox Code Playgroud)
当我这样做时,html打印出一大堆乱码,Nokogiri抱怨说"node_set必须是Nokogiri :: XML :: NOdeset
如果有人能提供帮助,我们将非常感激
第一件事.您的fetch方法返回一个Net::HTTPResponse对象,而不仅仅是正文.你应该为Nokogiri提供身体.
response = fetch("http://www.somewebsite.com/hahaha/")
puts response.body
noko = Nokogiri::HTML(response.body)
Run Code Online (Sandbox Code Playgroud)
我已经更新了你的脚本,因此它可以运行(下图).有些事情未定义.
require 'nokogiri'
require 'net/http'
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
url = URI.parse(URI.encode(uri_str.strip))
puts url
#get path
headers = {}
req = Net::HTTP::Get.new(url.path,headers)
#start TCP/IP
response = Net::HTTP.start(url.host,url.port) { |http|
http.request(req)
}
case response
when Net::HTTPSuccess
then #print final redirect to a file
puts "this is location" + uri_str
puts "this is the host #{url.host}"
puts "this is the path #{url.path}"
return response
# if you get a 302 response
when Net::HTTPRedirection
then
puts "this is redirect" + response['location']
return fetch(response['location'], limit-1)
else
response.error!
end
end
response = fetch("http://www.google.com/")
puts response
noko = Nokogiri::HTML(response.body)
puts noko
Run Code Online (Sandbox Code Playgroud)
该脚本不会出错,并打印内容.由于您收到的内容,您可能会收到Nokogiri错误.我在Nokogiri遇到的一个常见问题是字符编码.没有确切的错误,就无法分辨出发生了什么.
我建议查看以下StackOverflow问题
ruby 1.9:UTF-8中的无效字节序列 (特别是这个答案)
如何在Ruby 1.9.1中将Net :: HTTP响应转换为某种编码?