使用逗号,双引号和编码解析csv

zor*_*ras 5 ruby csv encoding parsing

我正在使用ruby 1.9来解析以下带有MacRoman字符的csv文件

# encoding: ISO-8859-1
#csv_parse.csv
Name, main-dialogue
"Marceu", "Give it to him ó he, his wife."
Run Code Online (Sandbox Code Playgroud)

我做了以下解析这个.

require 'csv'
input_string = File.read("../csv_parse.rb").force_encoding("ISO-8859-1").encode("UTF-8")
 #=> "Name, main-dialogue\r\n\"Marceu\", \"Give it to him  \x97 he, his wife.\"\r\n"

data = CSV.parse(input_string, :quote_char => "'", :col_sep => "/\",/")
 #=> [["Name, main-dialogue"], ["\"Marceu", " \"Give it to him  \x97 he, his wife.\""]]
Run Code Online (Sandbox Code Playgroud)

所以,问题是数据中的第二个数组是单个字符串而不是2个字符串,如: ["\"Marceu\"", " \"Give it to him \x97 he, his wife.\""]] 我试过:col_sep => ","(这是默认行为),但它给了我3个分裂.

header = CSV.parse(input_string, :quote_char => "'")[0].map{|a| a.strip.downcase unless a.nil? }
 #=> ["Name", "main-dialogue"]
Run Code Online (Sandbox Code Playgroud)

我要再次解析标题,因为这里没有双引号.

输出有意再次显示在浏览器中,因此字符ó应该像往常一样显示而不是\x97其他.

有什么方法可以解决上述问题吗?

mu *_*ort 8

我想你确实有MacRoman编码数据; 如果你这样做irb:

>> "\x97".force_encoding('MacRoman').encode('UTF-8')
Run Code Online (Sandbox Code Playgroud)

你得到这个:

=> "ó"
Run Code Online (Sandbox Code Playgroud)

这似乎是你期待的角色.所以你想要这个:

input_string = File.read("../csv_parse.rb").force_encoding('MacRoman').encode('UTF-8')
Run Code Online (Sandbox Code Playgroud)

然后你的CSV中有两列,用双引号引用列(所以你不需要:quote_char),分隔符是', '这样的,这应该是有用的:

data = CSV.parse(input_string, :col_sep => ", ")
Run Code Online (Sandbox Code Playgroud)

并且data看起来像这样:

[
    ["Name", "main-dialogue"],
    ["Marceu", "Give it to him  ó he, his wife."]
]
Run Code Online (Sandbox Code Playgroud)