在Ruby中使用Split时保留变音字符

Question

为什么这段代码(包含变音符号):

text = "Some super text with a german umlaut Wirtschaftsprüfer"
words = text.split(/\W+/)
words.each do |w|
  puts w
end

返回此结果(不保留以前给定的变音符号):

=> Some
=> super
=> text
=> with
=> a
=> german
=> umlaut
=> Wirtschaftspr
=> fer

在Ruby 1.9+中分割字符串时,有没有办法保留变音符号？

编辑:我使用ruby 1.9.3p286(2012-10-12修订版37165)[x86_64-darwin11.4.2]

Answer 1

[\W]只匹配非单词字符,即它等同于[^a-zA-Z0-9_],因此不包括(排除？)特殊字符和变音符号.您可以使用

words = text.split(/[^[:word:]]/)

它匹配所有Unicode"单词"字符,或

words = text.split(/[^\p{Latin}]/)

它匹配Unicode Latin脚本中的字符.
请注意,这两个语句都匹配其他语言的特殊字符,而不仅仅是德语.

请参阅http://www.ruby-doc.org/core-1.9.3/Regexp.html并查找(1)"字符类"和(2)"字符属性".