zor*_*ras 5 ruby csv encoding parsing
我正在使用ruby 1.9来解析以下带有MacRoman字符的csv文件
# encoding: ISO-8859-1
#csv_parse.csv
Name, main-dialogue
"Marceu", "Give it to him ó he, his wife."
Run Code Online (Sandbox Code Playgroud)
我做了以下解析这个.
require 'csv'
input_string = File.read("../csv_parse.rb").force_encoding("ISO-8859-1").encode("UTF-8")
#=> "Name, main-dialogue\r\n\"Marceu\", \"Give it to him \x97 he, his wife.\"\r\n"
data = CSV.parse(input_string, :quote_char => "'", :col_sep => "/\",/")
#=> [["Name, main-dialogue"], ["\"Marceu", " \"Give it to him \x97 he, his wife.\""]]
Run Code Online (Sandbox Code Playgroud)
所以,问题是数据中的第二个数组是单个字符串而不是2个字符串,如:
["\"Marceu\"", " \"Give it to him \x97 he, his wife.\""]]
我试过:col_sep => ","(这是默认行为),但它给了我3个分裂.
header = CSV.parse(input_string, :quote_char => "'")[0].map{|a| a.strip.downcase unless a.nil? }
#=> ["Name", "main-dialogue"]
Run Code Online (Sandbox Code Playgroud)
我要再次解析标题,因为这里没有双引号.
输出有意再次显示在浏览器中,因此字符ó应该像往常一样显示而不是\x97其他.
有什么方法可以解决上述问题吗?
我想你确实有MacRoman编码数据; 如果你这样做irb:
>> "\x97".force_encoding('MacRoman').encode('UTF-8')
Run Code Online (Sandbox Code Playgroud)
你得到这个:
=> "ó"
Run Code Online (Sandbox Code Playgroud)
这似乎是你期待的角色.所以你想要这个:
input_string = File.read("../csv_parse.rb").force_encoding('MacRoman').encode('UTF-8')
Run Code Online (Sandbox Code Playgroud)
然后你的CSV中有两列,用双引号引用列(所以你不需要:quote_char),分隔符是', '这样的,这应该是有用的:
data = CSV.parse(input_string, :col_sep => ", ")
Run Code Online (Sandbox Code Playgroud)
并且data看起来像这样:
[
["Name", "main-dialogue"],
["Marceu", "Give it to him ó he, his wife."]
]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6740 次 |
| 最近记录: |