在Rails中导入CSV - UTF-8中的非英文字符无效字节序列

bru*_*077 3 ruby csv import rubygems ruby-on-rails

我正在使用CSVMapper Gem将CSV文件中的某些记录导入到Rails 3模型中.(我使用这个宝石因为它是我发现最简单的方法)

无论如何,我用来导入记录的代码如下:

r = import('doc/socios_full.csv') do
    map_to Associate
    after_row lambda{|row, associate| associate.save }
    start_at_row 1
    [group,member,family_relationship_code,family_relationship_description,last_name,names,...]
#The previous line is actually longer, with more atts, but it's been cut to explain the example
end
Run Code Online (Sandbox Code Playgroud)

并且它工作得非常好,除非解析器遇到一些非英语字符,例如ó, é, ñ, í, °....那时我得到以下错误:

ArgumentError: invalid byte sequence in UTF-8
    from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1831:in `sub!'
    from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1831:in `block in shift'
    from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1825:in `loop'
    from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1825:in `shift'
    from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1767:in `each'
    from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/csv-mapper-0.5.1/lib/csv-mapper.rb:106:in `each_with_index'
    from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/csv-mapper-0.5.1/lib/csv-mapper.rb:106:in `import'
    from (irb):63
    from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands/console.rb:44:in `start'
    from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands/console.rb:8:in `start'
    from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands.rb:23:in `<top (required)>'
    from script/rails:6:in `require'
    from script/rails:6:in `<main>'
Run Code Online (Sandbox Code Playgroud)

我确实知道这一点,因为如果我替换所有这些字符,问题就会消失,直到解析器找到另一个非英文字符.问题是我有一个50k的记录文件,因此搜索我能想到的每个字符并尝试每次导入所有这些记录都非常耗时.

有没有办法忽略这些错误并允许解析器继续?或者是否有更简单的方法来导入此CSV文件?

小智 14

像这样做:

CSV.foreach(filename, :headers => true , :encoding => 'ISO-8859-1') do |row|
Run Code Online (Sandbox Code Playgroud)

我在尝试读取通过MS Excel保存的CSV文件时遇到了同样的问题.您可以将编码指定为选项.我猜它默认采用UTF-8.