由Nokogiri提取的错误编码的Html

Question

由Nokogiri提取的错误编码的Html

我用Nokogiri来解析一个HTML.我需要页面中的内容和图像标签,所以我使用inner_html而不是content方法.但返回的值content编码正确,而错误编码inner_html.需要注意的是,该页面是中文的,不使用UTF-8编码.

这是我的代码:

# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'iconv'

doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030')

doc.css('td.font_info').each do |link|
  # output, correct but not i expect: ????????
  puts link.content

  # output, wrong and not i expect: <img ....></img>???????????????????
  # I expect: <img ....></img>????????
  puts link.inner_html
end

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 5

这是写在自述文件的"编码"部分:http://nokogiri.org/

字符串始终在内部存储为UTF-8.返回文本值的方法将始终返回UTF-8编码的字符串.返回XML的方法(如to_xml,to_html和inner_html)将返回一个像源文档一样编码的字符串.

因此,inner_html如果要将其作为UTF-8字符串获取,则应手动转换字符串:

puts link.inner_html.encode('utf-8') # for 1.9.x

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年前
查看次数：	793 次
最近记录：	11 年，4 月前