正则表达式错误:指定了太多的多字节代码范围

And*_*own 3 ruby regex utf-8 character-encoding ruby-1.9

我有一个需要匹配一堆字符的正则表达式.代码没有问题是ruby 1.8.7,但在1.9中它会杀死它.我想它与编码有关,我已经做了很多谷歌搜索,所以也许有人可以启发我.

码:

# encoding: utf-8
non_latin_hashtag_chars = [
  (0xA960..0xA97F).to_a, # Hangul Jamo Extended-A
  (0xAC00..0xD7AF).to_a, # Hangul Syllables
  (0xD7B0..0xD7FF).to_a  # Hangul Jamo Extended-B
].flatten.pack('U*').freeze

e = /[a-z_#{non_latin_hashtag_chars}]/io
Run Code Online (Sandbox Code Playgroud)

错误:

~/Desktop: ruby regex_test.rb 
regex_test.rb:13:in `<main>': too many multibyte code ranges are specified: /[a-z_??????????????????????????????????????????????????????????????????????????????......
Run Code Online (Sandbox Code Playgroud)

Mar*_*une 7

正如twehad指出的那样,有一个10K的限制在正则表达式.

无论如何,你应该在Regexp中使用unicode范围:

/[a-z_\uA960-\uA97F\uAC00-\uD7AF\uD7B0-\uD7FF]/io
Run Code Online (Sandbox Code Playgroud)

我不是韩语专家所以我不知道它是否相同,但如果你想匹配所有韩文字符,你应该使用该类代替:

/[a-z_\p{Hangul}]/io
Run Code Online (Sandbox Code Playgroud)