除了XHTML自包含标记之外,RegEx匹配开放标记

#!/usr/bin/env ruby

class NewsParser

  def initialize
      Dir.glob("./**/index.htm") do |file|
        @file = IO.read file 
        parsed = @file.scan(/<h1(.*)>(.*?)<\/h1>(.*)<!-- InstanceEndEditable -->/im)
        self.write(parsed)
      end
  end

  def write output
    @contents = output
    open('output.txt', 'a') do |f| 
      f << @contents[0][0]+"\n\n"+@contents[0][1]+"\n\n\n\n" 
    end
  end

end

p = NewsParser.new

Run Code Online (Sandbox Code Playgroud)

编辑:这是错误消息:

news_parser.rb:10:in 'scan': invalid byte sequence in UTF-8 (ArgumentError)

已解决:使用的组合: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) 和 encoding: UTF-8 解决问题.

谢谢!

ruby

red*_*gem

2012 03-08

21
推荐指数

1
解决办法

2万
查看次数

Rails 4.2 - 如何在没有gem'iconv'的CSV导出中修复ascii代码？

在Rails 4.2应用程序中导出csv时,csv输出中有ascii代码用于中文字符(UTF8):

ä¸åˆåŒç†Šå·¥ç‰ç”¨é¤

Run Code Online (Sandbox Code Playgroud)

我们send_data没有运气就尝试了选项:

send_data @payment_requests.to_csv, :type => 'text/csv; charset=utf-8; header=present'

Run Code Online (Sandbox Code Playgroud)

和:

send_data @payment_requests.to_csv.force_encoding("UTF-8")

Run Code Online (Sandbox Code Playgroud)

在模型中,有强制编码utf8:

# encoding: utf-8

Run Code Online (Sandbox Code Playgroud)

但它不起作用.有在线帖子谈论使用gem iconv.但是iconv取决于平台的ruby版本.是否有更清洁的解决方案来修复Rails 4.2 csv导出中的ascii？

csv ruby-on-rails ruby-on-rails-4

use*_*363

2015 04-23

9
推荐指数

2
解决办法

2327
查看次数

为什么我在UTF-8中得到无效的字节序列

为什么我会收到此错误？

invalid byte sequence in UTF-8

Run Code Online (Sandbox Code Playgroud)

用于加载图像:

= image_tag 'features_home/show1.png'

Run Code Online (Sandbox Code Playgroud)

编辑

我注意到这个问题发生在我之后bundle update,任何图像都会出现错误..我会尝试在这里添加详细信息:

堆栈跟踪:

  Rendered home/home.html.haml within layouts/application (229.9ms)
Completed 500 Internal Server Error in 1047ms
invalid byte sequence in UTF-8 excluded from capture: DSN not set

ActionView::Template::Error (invalid byte sequence in UTF-8):
    81:           / Carousel items
    82:           .carousel-inner
    83:             .active.item
    84:               = image_tag 'features_home/show1.png'
    85:               -#.carousel-caption
    86:               -#  %h4
    87:               -#  %p
  app/views/home/home.html.haml:84:in `block in _app_views_home_home_html_haml__623651309533727079_70331260863620'
  app/views/home/home.html.haml:33:in `_app_views_home_home_html_haml__623651309533727079_70331260863620'
  lib/rack/seoredirect.rb:20:in `call'


  Rendered /Users/Apple/.rvm/gems/ruby-2.2.2@myapp/gems/actionpack-4.2.0/lib/action_dispatch/middleware/templates/rescues/_source.erb (115.6ms)
  Rendered /Users/Apple/.rvm/gems/ruby-2.2.2@myapp/gems/actionpack-4.2.0/lib/action_dispatch/middleware/templates/rescues/_trace.html.erb (23.1ms)
  Rendered …

Run Code Online (Sandbox Code Playgroud)

ruby-on-rails ruby-on-rails-4

sim*_*imo

2017 09-28

8
推荐指数

1
解决办法

623
查看次数

Ruby用Nokogiri解析HTTPresponse

用Nokogiri解析HTTP响应

嗨,我在使用Nokogiri解析HTTPresponse对象时遇到问题.

我用这个函数在这里获取一个网站:

获取链接

def fetch(uri_str, limit = 10)


  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  url = URI.parse(URI.encode(uri_str.strip))
  puts url

  #get path
  req = Net::HTTP::Get.new(url.path,headers)
  #start TCP/IP
  response = Net::HTTP.start(url.host,url.port) { |http|
        http.request(req)
  }
  case response
  when Net::HTTPSuccess
    then #print final redirect to a file
    puts "this is location" + uri_str
    puts "this is the host #{url.host}"
    puts "this is the path #{url.path}"

    return response
    # if you get a …

Run Code Online (Sandbox Code Playgroud)

ruby nokogiri

Max*_*Pie

lucky-day

6
推荐指数

1
解决办法

6964
查看次数

如何在ruby中创建一个带有"错误编码"的字符串？

我有一个生产中的某个文件,我无法访问它,当由ruby脚本加载时,针对内容的正则表达式失败了ArgumentError => invalid byte sequence in UTF-8.

我相信我有一个基于所有要点的答案:ruby 1.9:UTF-8中的无效字节序列

# Remove all invalid and undefined characters in the given string
# (ruby 1.9.3)
def safe_str str

  # edited based on matt's comment (thanks matt)
  s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
  s.encode!('utf-8', 'utf-16')
end

Run Code Online (Sandbox Code Playgroud)

但是,我现在想构建我的rspec以验证代码是否有效.我无法访问导致问题的文件,所以我想以编程方式创建一个带有错误编码的字符串.

我尝试过以下方面的变化:

bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.length.should > safe_str(bad_str).length

Run Code Online (Sandbox Code Playgroud)

要么,

bad_str = (100..1000).to_a.pack(c*)
bad_str.length.should > safe_str(bad_str).length

Run Code Online (Sandbox Code Playgroud)

但长度总是一样的.我也尝试过不同的角色范围; 并不总是100到1000.

有关如何在ruby 1.9.3脚本中使用无效编码构建字符串的任何建议？

ruby character-encoding

GSP*_*GSP

2017 05-23

5
推荐指数

2
解决办法

1149
查看次数

使用不同的编码和库解析CSV文件

尽管有关主题的SO线程很多,但我在解析CSV方面遇到了麻烦.这是从Adwords关键字规划师下载的.csv文件.以前,Adwords可以选择将数据导出为"普通CSV"(可以使用Ruby CSV库进行解析),现在选项可以是Adwords CSV或Excel CSV.这些格式中的两个都会导致此问题(由终端会话说明):

file = File.open('public/uploads/testfile.csv')
 => #<File:public/uploads/testfile.csv> 

file.read.encoding
 => #<Encoding:UTF-8> 

require 'csv'
 => true 

CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8

Run Code Online (Sandbox Code Playgroud)

让我们改变编码,看看是否有帮助:

file.close
 => nil 

file = File.open("public/uploads/testfile.csv", "r:ISO-8859-1")
 => #<File:public/uploads/testfile.csv> 

file.read.encoding 
=> #<Encoding:ISO-8859-1> 

CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8

Run Code Online (Sandbox Code Playgroud)

让我们尝试使用不同的CSV库:

require 'smarter_csv'
 => true 

file.close
 => nil 

file = SmarterCSV.process('public/uploads/testfile.csv')
ArgumentError: invalid byte sequence in UTF-8

Run Code Online (Sandbox Code Playgroud)

这是一个不赢的局面吗？我是否必须滚动自己的CSV解析器？

我正在使用Ruby 1.9.3p374.谢谢!

更新1:

使用评论中的建议,这是当前版本:

file_contents = File.open("public/uploads/new-format/testfile-adwords.csv", 'rb').read

require …

Run Code Online (Sandbox Code Playgroud)

ruby csv parsing google-adwords

abb*_*jam

2013 12-22

5
推荐指数

1
解决办法

4617
查看次数

正则表达式捕获冒号分隔的键值对，具有多行值

我目前正在 Ruby on Rails（在 Eclipse 中）中开发一个项目，我的任务是使用正则表达式将数据块拆分为相关部分。

我决定根据 3 个参数来分解数据：

该行必须以大写字母开头（正则表达式等效 - /^[A-Z]/）
/$":"/它必须以 : （正则表达式等效 - ）结尾

我将不胜感激任何帮助......我在控制器中使用的代码是：

@f = File.open("report.rtf")  
@fread = @f.read  
@chunk = @fread.split(/\n/)

Run Code Online (Sandbox Code Playgroud)

其中@chunk是将由拆分创建的数组，@fread是要拆分的数据（按新行）。

任何帮助将不胜感激，非常感谢！

我无法发布确切的数据，但基本上是这样的（这与医学有关）

考试 1：CBW 8080

结果：

本报告是通过具体测量得出的。请参阅报告原文。

比较：2012年1月30日、2012年3月8日、2012年4月9日

RECIST 1.1：废话废话

理想的输出是一个数组，其中包含：

["Exam 1:", "CBW 8080", "RESULT", "This report is dictated with specific measurement. Please see the original report.", "COMPARISON:", "1/30/2012, 3/8/12, 4/9/12", "RECIST 1.1:", "BLAH BLAH BLAH"]

Run Code Online (Sandbox Code Playgroud)

PS 我只是使用 \n 作为占位符，直到它正常工作

ruby regex eclipse ruby-on-rails

Joh*_*ugh

2012 10-21

4
推荐指数

1
解决办法

4528
查看次数

UTF-8中的Ruby无效字节序列(ArgumentError)

可能重复:
ruby 1.9:UTF-8中的无效字节序列

我正在构建一个文件系统爬虫,并在运行我的脚本时收到以下错误:

wordcrawler.rb:8:in `block in <main>': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/Anconia/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:41:in `block in find'
    from /Users/Anconia/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:40:in `catch'
    from /Users/Anconia/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:40:in `find'
    from wordcrawler.rb:5:in `<main>'

Run Code Online (Sandbox Code Playgroud)

这是我的代码:

require 'find'

count = 0

Find.find('/Users/Anconia/') do |file|                   # '/' for root directory on OS X
  if file =~ /\b(\.txt|\.doc|\.docx)\b/                # check if filename ends in desired format
    contents = File.read(file)
      if contents =~ /regex/
      puts file
      count += 1
    end
  end
end

puts "#{count} files were found"

Run Code Online (Sandbox Code Playgroud)

在我的开发环境中,我使用ruby 1.9.3; 但是,当我切换到ruby …

ruby encoding web-crawler

Anc*_*nia

2017 05-23

3
推荐指数

1
解决办法

2万
查看次数

File.readlines UTF-8中的无效字节序列(ArgumentError)

我正在处理一个文件,其中包含来自Web的数据,并在某些日志文件中遇到UTF-8(ArgumentError)错误中的无效字节序列.

a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a

Run Code Online (Sandbox Code Playgroud)

我想让这个解决方案正常运行.我见过有人在做

.encode!('UTF-8', 'UTF-8', :invalid => :replace)

但它似乎没有用File.readlines.

File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)

':undefined方法`编码!' for#(NoMethodError)

什么是在文件读取过程中过滤/转换无效UTF-8字符最直接的方法？

~~尝试1~~

试过这个,但它失败了同样的无效字节序列错误.

IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s| # extract three columns: time stamp, url, ip s = s.parse_csv; { timestamp: s[0], url: s[1], ip: s[3] } end
Run Code Online (Sandbox Code Playgroud)

解

这似乎对我有用.

a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s …

Run Code Online (Sandbox Code Playgroud)

ruby

pab*_*808

2017 05-23

2
推荐指数

1
解决办法

7529
查看次数

Ruby 1.9.3需要UTF-8解释中的无效字节序列

我在Windows 7上通过Cygwin安装了RVM和Ruby.我现在正在尝试按照本指南安装Omega软件包.命令是

bundle install

Run Code Online (Sandbox Code Playgroud)

这给出了错误'找不到命令'.解决方案是通过安装bundler

gem install bundler

Run Code Online (Sandbox Code Playgroud)

但这会产生"UTF-8错误中的无效字节序列".对于这个问题的解决中描述了这个职位.但我不明白我应该把这个片段放在哪里.

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

Run Code Online (Sandbox Code Playgroud)

请解释这段代码的放置位置.

谢谢!

ruby encode utf-8 rvm

ban*_*per

2017 05-23

1
推荐指数

1
解决办法

2132
查看次数