我们应该考虑使用范围[az]作为错误吗?

w.k*_*w.k 22 regex unicode perl locale pcre

在我的语言环境(et_EE)中[a-z]表示:

abcdefghijklmnopqrsšz
Run Code Online (Sandbox Code Playgroud)

因此,不包括6个ASCII字符(tuvwxy)和一个来自爱沙尼亚字母(ž).我看到很多模块仍在使用正则表达式

/\A[0-9A-Z_a-z]+\z/
Run Code Online (Sandbox Code Playgroud)

对我来说,似乎错误的方式来定义ASCII字母数字字符的范围,我认为它应该替换为:

/\A\p{PosixAlnum}+\z/
Run Code Online (Sandbox Code Playgroud)

第一个仍然被认为是惯用的方式吗?或接受解决方案?还是一个bug?

或者最后一个警告?

Dav*_* W. 8

回到旧的Perl 3.0天,一切都是ASCII,Perl反映了这一点.\w意思是一样的[0-9A-Z_a-z].而且,我们喜欢它!

但是,Perl不再绑定到ASCII.我刚刚停止使用[a-z],因为当我写的程序不适用于不是英语的语言时,我大吼大叫.作为一个美国人,你一定想象我的惊喜,发现世界上至少有几千人不会说英语.

[0-9A-Z_a-z]无论如何,Perl有更好的处理方式.您可以使用该[[:alnum:]]集合或只是使用\w哪些应该做正确的事情.如果您只能使用小写字符,则可以使用[[:lower:]]而不是[a-z](假设使用英语类型的语言).(即使在EBCDIC平台上,Perl也会花费一些时间来获得[az]意味着26个字符a,b,c,... z.)

如果只需要指定ASCII,则可以添加/a限定符.如果您指的是特定于语言环境,则应在"使用语言环境"的词法范围内编译正则表达式.(避免使用/ l修饰符,因为它仅适用于正则表达式模式,而不是其他任何内容.例如在's/[[:lower:]] /\U $&/ lg'中,模式是使用locale编译的,但是\ U不是.这可能应该被认为是Perl中的一个错误,但它是目前工作的方式./ l修饰符实际上只用于内部簿记,不应直接输入.)实际上,最好在输入到程序时转换您的语言环境数据,并在输出时将其转换回来,同时在内部使用Unicode.如果你的语言环境是新的UTF-8之一,那么5.16中的一个新功能'使用locale":not_characters"'可以让你的语言环境的其他部分在Perl中无缝地工作.

$word =~ /^[[:alnum:]]+$/   # $word contains only Posix alphanumeric characters.
$word =~ /^[[:alnum:]]+$/a  # $word contains only ASCII alphanumeric characters.
{ use locale;
  $word =~ /^[[:alnum:]]+$/;# $word contains only alphanum characters for your locale
}
Run Code Online (Sandbox Code Playgroud)

现在,这是一个错误吗?如果程序没有按预期工作,那么这是一个简单明了的错误.如果你真的想ASCII字符序列,[a-z],那么程序员应该使用[[:lower:]]/a限定符.如果您想要所有可能的小写字符,包括其他语言的字符,您应该只使用[[:lower:]].


eis*_*eis 6

由于这个问题超越了Perl,我有兴趣了解它的一般情况.使用本机正则表达式支持,Perl,PHP,Python,Ruby,Java和Javascript在流行的编程语言上进行测试,结论如下:

  • [a-z]将始终匹配每种语言中的ASCII-7 az范围,并且语言环境设置不会以任何方式影响它.字符喜欢žš从不匹配.
  • \w可能会或可能不会匹配字符žš,这取决于在创建的正则表达式给出的编程语言和参数.对于这个表达,变化是最大的,因为在某些语言中它们永远不匹配,与选项无关,在其他语言中它们总是匹配的并且在某些语言中取决于它们.
  • POSIX [[:alpha:]]和Unicode \p{Alpha}\p{L},如果他们在的问题和相应配置的编程语言的正则表达式系统是用来支持,会匹配相同的字符žš.

请注意,"适当的配置"不需要更改区域设置:更改区域设置不会影响任何测试系统中的结果.

为了安全起见,我还测试了命令行Perl,grep和awk.从那里,命令行Perl的行为与上面相同.但是,grep和awk似乎有不同于其他行为的行为,因为对于他们来说,语言环境也很重要[a-z].行为也是版本和实现特定的.

在这种情况下 - grep,awk或类似的命令行工具 - 我同意使用a-z范围而没有定义区域设置可能会被视为一个错误,因为你无法真正知道你最终会得到什么.


如果我们按语言详细说明,状态似乎是:

Java的

在java中,如果没有指定unicode类,则\p{Alpha}工作[a-z],如果匹配,则为unicode字母字符ž.\w将匹配字符,ž如果存在unicode标志,如果不存在\p{L}则匹配,并且无论unicode标志如何都将匹配.没有区域设置感知的正则表达式或支持[[alpha]].

PHP

在PHP \w,[[:alpha:]]\p{L}将匹配类似的字符ž,如果它不是,如果Unicode的开关存在的,而不是.\p{Alpha}不受支持.Locale对正则表达式没有影响.

蟒蛇

\w如果存在unicode标志并且不存在locale标志,则将匹配提到的字符.对于Unicode字符串,Unicode的标志假设默认情况下,如果Python 3上是使用,但不能与Python 2的Unicode \p{Alpha},\p{L}或POSIX [[:alpha:]]不是用Python支持.

使用特定于语言环境的正则表达式的修饰符显然仅适用于每个字符1个字节的字符集,使其无法用于unicode.

Perl的

\w matches previously mentioned characters in addition to matching [a-z]. Unicode \p{Letter}, \p{Alpha} and POSIX [[:alpha:]] are supported and work as expected. Unicode and locale flags for regular expression didn't change the results, and neither did change of locale or use locale;/no locale;.

Behavour does not change if we run tests using commandline Perl.

Ruby

[a-z] and \w detect just the characters [a-z], irrelevant of options. Unicode \p{Letter}, \p{Alpha} and POSIX [[:alpha:]] are supported and working as expected. Locale does not have impact.

Javascript

[a-z]\w始终只检测字符[a-z].目前支持/u在ECMA2015,这主要是由主流浏览器都支持的Unicode开关,但它并没有带来支持[[:alpha:]],\p{Alpha}\p{L}或改变的行为\w.unicode开关确实将unicode字符的处理添加为一个字符,这在以前是个问题.

客户端javascript和Node.js的情况相同.

AWK

对于AWK,在文章A.8 Regexp Ranges和Locales:A Long Sad Story中发布的状态有更长的描述.它详细说明了在unix工具的旧世界中,[a-z]检测小写字母的正确方法,这就是当时工具的工作原理.但是,1992年POSIX引入了语言环境,并更改了字符类的解释,以便按照排序顺序定义字符顺序,将其绑定到语言环境.这也是AWK当时采用的(3.x系列),这导致了几个问题.当开发4.x系列时,POSIX 2008定义了未定义的顺序,维护者恢复了原始行为.

现在大多使用4.x版本的AWK.使用它时,[a-z]匹配az忽略任何语言环境更改,\w并且[[:alpha:]]将匹配特定于语言环境的字符.不支持Unicode\p {Alpha}和\ p {L}.

grep的

Grep(以及sed,ed)使用GNU Basic Regular Expressions,这是一种古老的风格.它不支持unicode字符类.

至少gnu grep 2.16和2.25似乎遵循1992 posix,因为locale对于[a-z]for \w和for 也很重要[[:alpha:]].这意味着,例如,如果使用爱沙尼亚语区域,则[az]仅匹配集合xuzvöä中的z.


下面列出了每种语言的测试代码.

Java(1.8.0_131)

import java.util.regex.*;
import java.util.Locale;

public class RegExpTest {
    public static void main(String args[]) {
        verify("v", 118);
        verify("š", 353);
        verify("ž", 382);

        tryWith("v");
        tryWith("š");
        tryWith("ž");
    }
    static void tryWith(String input) {
        matchWith("[a-z]", input);
        matchWith("\\w", input);
        matchWith("\\p{Alpha}", input);
        matchWith("\\p{L}", input);
        matchWith("[[:alpha:]]", input);
    }

    static void matchWith(String pattern, String input) {
        printResult(Pattern.compile(pattern), input);
        printResult(Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS), input);
    }
    static void printResult(Pattern pattern, String input) {
        System.out.printf("%s\t%03d\t%5s\t%-10s\t%-10s\t%-5s%n",
          input, input.codePointAt(0), Locale.getDefault(),
          specialFlag(pattern.flags()),
          pattern, pattern.matcher(input).matches());
    }
    static String specialFlag(int flags) {
      if ((flags & Pattern.UNICODE_CHARACTER_CLASS) == Pattern.UNICODE_CHARACTER_CLASS) {
          return "UNICODE_FLAG";
      }
      return "";
    }
    static void verify(String str, int code) {
        if (str.codePointAt(0) != code) {
            throw new RuntimeException("your editor is not properly configured for this character: " + str);
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

PHP(7.1.5)

<?php
/*
PHP, even with 7, only has binary strings that can be operated with unicode-aware
functions, if needed. So functions operating them need to be told which charset to use.

When there is encoding assumed and not specified, PHP defaults to ISO-8859-1.
*/


// PHP7 and extension=php_intl.dll enabled in PHP.ini is needed for IntlChar class
function codepoint($char) {
  return IntlChar::ord($char);
}

function verify($inputp, $code) {
  if (codepoint($inputp) != $code) {
    throw new Exception(sprintf('Your editor is not configured correctly for %s (result %s, should be %s)',
      $inputp, codepoint($inputp), $code));
  }
}

$rowindex = 0;
$origlocale = getlocale();

verify('v', 118);
verify('š', 353); // https://en.wikipedia.org/wiki/%C5%A0#Computing_code
verify('ž', 382); // https://en.wikipedia.org/wiki/%C5%BD#Computing_code

function tryWith($input) {
  matchWith('[a-z]', $input);
  matchWith('\\w', $input);
  matchWith('[[:alpha:]]', $input); // POSIX, http://www.regular-expressions.info/posixbrackets.html
  matchWith('\p{L}', $input);
}
function matchWith($pattern, $input) {
  global $origlocale;
  selectLocale($origlocale);
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale('C'); # default (root) locale
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale(['et_EE', 'et_EE.UTF-8', 'Estonian_Estonia.1257']);
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale($origlocale);
}
function selectLocale($locale) {
  if (!is_array($locale)) {
    $locale = [$locale];
  }
  // On Windows, no UTF-8 locale can be set
  // https://stackoverflow.com/a/16120506/365237
  // https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
  // Available Windows locales
  // https://docs.moodle.org/dev/Table_of_locales
  $retval = setlocale(LC_ALL, $locale);
  //printf("setting locale %s, retval was %s\n", join(',', $locale), $retval);
  if ($retval === false || $retval === null) {
    throw new Exception(sprintf('Setting locale %s failed', join(',', $locale)));
  }
}
function getlocale() {
  return setlocale(LC_ALL, 0);
}
function printResult($pattern, $input) {
  global $rowindex;
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n",
        $rowindex, $input, codepoint($input), getlocale(),
        specialFlag($pattern), 
        $pattern, (preg_match($pattern, $input) === 1)?'true':'false');
  $rowindex = $rowindex + 1;
}
function specialFlag($pattern) {
  $arr = explode('/',$pattern);
  $lastelem = array_pop($arr);
  if (strpos($lastelem, 'u') !== false) {
    return 'UNICODE';
  }
  return '';
}

tryWith('v');
tryWith('š');
tryWith('ž');
Run Code Online (Sandbox Code Playgroud)

Python(3.5.3)

# -*- coding: utf-8 -*-

# with python, there are two strings: unicode strings and regular ones.
# when you use unicode strings, regular expressions also take advantage of it,
# so no need to tell that separately. However, if you want to be using specific
# locale, that you need to tell.

# Note that python3 regexps defaults to unicode mode if unicode regexp string is used,
# python2 does not. Also strings are unicode strings in python3 by default.

# summary: [a-z] is always [a-z], \w will match if unicode flag is present and
# locale flag is not present, no unicode \p{Letter} or POSIX :alpha: exists.
# Letters outside ascii-7 never match \w if locale-specific
# regexp is used, as it only supports charsets with one byte per character
# (https://lists.gt.net/python/python/850772).

# Note that in addition to standard https://docs.python.org/3/library/re.html, more
# complete https://pypi.python.org/pypi/regex/ third-party regexp library exists.

import re, locale

def verify(inputp, code):
  if (ord(inputp[0]) != code):
    raise Exception('Your editor is not configured correctly for %s (result %s)' % (inputp, ord(inputp[0])))
  return

rowindex = 0
origlocale = locale.getlocale(locale.LC_ALL)  

verify(u'v', 118)
verify(u'š', 353)
verify(u'ž', 382)

def tryWith(input):
  matchWith(u'[a-z]', input)
  matchWith(u'\\w', input)

def matchWith(pattern, input):
  global origlocale
  locale.setlocale(locale.LC_ALL, origlocale)
  printResult(re.compile(pattern), input)
  printResult(re.compile(pattern, re.UNICODE), input)
  printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)

  matchWith2(pattern, input, 'C') # default (root) locale
  matchWith2(pattern, input, 'et_EE')
  matchWith2(pattern, input, 'et_EE.UTF-8')
  matchWith2(pattern, input, 'Estonian_Estonia.1257') # Windows locale
  locale.setlocale(locale.LC_ALL, origlocale)

def matchWith2(pattern, input, localeParam):
  try:
    locale.setlocale(locale.LC_ALL, localeParam) # default (root) locale
    printResult(re.compile(pattern), input)
    printResult(re.compile(pattern, re.UNICODE), input)
    printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)
  except locale.Error:
    print("Locale %s not supported on this platform" % localeParam)

def printResult(pattern, input):
  global rowindex
  try:
    print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
          (rowindex, input, ord(input[0]), locale.getlocale(), \
          specialFlag(pattern.flags), \
          pattern.pattern, pattern.match(input) != None))
  except UnicodeEncodeError:
    print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
          (rowindex, '?', ord(input[0]), locale.getlocale(), \
          specialFlag(pattern.flags), \
          pattern.pattern, pattern.match(input) != None))
  rowindex = rowindex + 1      

def specialFlag(flags):
  ret = []
  if ((flags & re.UNICODE) == re.UNICODE):
    ret.append("UNICODE_FLAG")
  if ((flags & re.LOCALE) == re.LOCALE):
    ret.append("LOCALE_FLAG")
  return ','.join(ret)

tryWith(u'v')
tryWith(u'š')
tryWith(u'ž')
Run Code Online (Sandbox Code Playgroud)

Perl(v5.22.3)

# Summary: [a-z] is always [a-z], \w always seems to recognize given test chars and
# unicode \p{Letter}, \p{Alpha} and POSIX :alpha: are supported.
# Unicode and locale flags for regular expression didn't matter in this use case.

use warnings;
use strict;
use utf8;
use v5.14;
use POSIX qw(locale_h);
use Encode;
binmode STDOUT, "utf8";

sub codepoint {
  my $inputp = $_[0];
  return unpack('U*', $inputp);
}
sub verify {
  my($inputp, $code) = @_;
  if (codepoint($inputp) != $code) {
    die sprintf('Your editor is not configured correctly for %s (result %s)', $inputp, codepoint($inputp))
  }
}

sub getlocale {
  return setlocale(LC_ALL);
}
my $rowindex = 0;
my $origlocale = getlocale();

verify('v', 118);
verify('š', 353);
verify('ž', 382);

# printf('orig locale is %s', $origlocale);

sub tryWith {
  my ($input) = @_;
  matchWith('[a-z]', $input);
  matchWith('\w', $input);
  matchWith('[[:alpha:]]', $input);
  matchWith('\p{Alpha}', $input);
  matchWith('\p{L}', $input);
}

sub matchWith {
  my ($pattern, $input) = @_;
  my @locales_to_test = ($origlocale, 'C','C.UTF-8', 'et_EE.UTF-8', 'Estonian_Estonia.UTF-8');
  for my $testlocale (@locales_to_test) {
    use locale;
    # printf("Testlocale %s\n", $testlocale);
    setlocale(LC_ALL, $testlocale);
    printResult($pattern, $input, '');
    printResult($pattern, $input, 'u');
    printResult($pattern, $input, 'l');
    printResult($pattern, $input, 'a');
   };
  no locale;
  setlocale(LC_ALL, $origlocale);
  printResult($pattern, $input, '');
  printResult($pattern, $input, 'u');
  printResult($pattern, $input, 'l');
  printResult($pattern, $input, 'a');
}


sub printResult{
  no warnings 'locale';
              # for this test, as we want to be able to test non-unicode-compliant locales as well
              # remove this for real usage

  my ($pattern, $input, $flags) = @_;
  my $regexp = qr/$pattern/;
  $regexp = qr/$pattern/u if ($flags eq 'u');
  $regexp = qr/$pattern/l if ($flags eq 'l');
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n", 
        $rowindex, $input, codepoint($input), getlocale(),
        $flags, $pattern, (($input =~ $regexp) ? 'true':'false'));
  $rowindex = $rowindex + 1;
}

tryWith('v');
tryWith('š');
tryWith('ž');
Run Code Online (Sandbox Code Playgroud)

Ruby (ruby 2.2.6p396 (2016-11-15 revision 56800) [x64-mingw32])

# -*- coding: utf-8 -*-

# Summary: [a-z] and \w are always [a-z], unicode \p{Letter}, \p{Alpha} and POSIX
# :alpha: are supported. Locale does not have impact.

# Ruby doesn't seem to be able to interact very well with locale without 'locale'
# rubygem (https://github.com/mutoh/locale), so that is used.

require 'rubygems'
require 'locale'

def verify(inputp, code)
  if (inputp.unpack('U*')[0] != code)
    raise Exception, sprintf('Your editor is not configured correctly for %s (result %s)', inputp, inputp.unpack('U*')[0])
  end
end

$rowindex = 0
$origlocale = Locale.current
$origcharmap = Encoding.locale_charmap

verify('v', 118)
verify('š', 353)
verify('ž', 382)

# printf('orig locale is %s.%s', $origlocale, $origcharmap)
def tryWith(input)
  matchWith('[a-z]', input)
  matchWith('\w', input)
  matchWith('[[:alpha:]]', input)
  matchWith('\p{Alpha}', input)
  matchWith('\p{L}', input)
end  

def matchWith(pattern, input)
  locales_to_test = [$origlocale, 'C', 'et_EE', 'Estonian_Estonia']
  for testlocale in locales_to_test
    Locale.current = testlocale
    printResult(Regexp.new(pattern), input)
    printResult(Regexp.new(pattern.force_encoding('utf-8'),Regexp::FIXEDENCODING), input)
  end
  Locale.current = $origlocale
end

def printResult(pattern, input)
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n", 
        $rowindex, input, input.unpack('U*')[0], Locale.current,
        specialFlag(pattern),
        pattern, !pattern.match(input).nil?)
  $rowindex = $rowindex + 1
end

def specialFlag(pattern)
  return pattern.encoding
end

tryWith('v')
tryWith('š')
tryWith('ž')
Run Code Online (Sandbox Code Playgroud)

Javascript (node.js) (v6.10.3)

function match(pattern, input) {
    try {
        var re = new RegExp(pattern, "u");
        return input.match(re) !== null;
    } catch(e) {
        return 'unsupported';
    }
}
function regexptest() {
    var chars = [
        String.fromCodePoint(118),
        String.fromCodePoint(353),
        String.fromCodePoint(382)
    ];
    for (var i = 0; i < chars.length; i++) {
        var char = chars[i];
        console.log(
            char
            +'\t'
            + char.codePointAt(0)
            +'\t'
            +(match("[a-z]", char))
            +'\t'
            +(match("\\w", char))
            +'\t'
            +(match("[[:alpha:]]", char))
            +'\t'
            +(match("\\p{Alpha}", char))
            +'\t'
            +(match("\\p{L}", char))
            );
    }
}

regexptest();
Run Code Online (Sandbox Code Playgroud)

Javascript (web browsers)

function match(pattern, input) {
    try {
        var re = new RegExp(pattern, "u");
        return input.match(re) !== null;
    } catch(e) {
        return 'unsupported';
    }
}
window.onload = function() {
    var chars = [
        String.fromCodePoint(118),
        String.fromCodePoint(353),
        String.fromCodePoint(382)
    ];
    for (var i = 0; i < chars.length; i++) {
        var char = chars[i];
        var table = document.getElementById('results');
        table.innerHTML += 
            '<tr><td>' + char
            +'</td><td>'
            + char.codePointAt(0)
            +'</td><td>'
            +(match("[a-z]", char))
            +'</td><td>'
            +(match("\\w", char))
            +'</td><td>'
            +(match("[[:alpha:]]", char))
            +'</td><td>'
            +(match("\\p{Alpha}", char))
            +'</td><td>'
            +(match("\\p{L}", char))
            +'</td></tr>';
    }
}
Run Code Online (Sandbox Code Playgroud)
table {
    border-collapse: collapse;
}
table td, table th {
    border: 1px solid black;
}
table tr:first-child th {
    border-top: 0;
}
table tr:last-child td {
    border-bottom: 0;
}
table tr td:first-child,
table tr th:first-child {
    border-left: 0;
}
table tr td:last-child,
table tr th:last-child {
    border-right: 0;
}
Run Code Online (Sandbox Code Playgroud)
<!DOCTYPE html> 
<html>
<head>
    <meta charset="utf-8" /> 
</head>
<body>
    <table id="results">
    <tr>
    	<td>char</td>
    	<td>codepoint</td>
    	<td>[a-z]</td>
    	<td>\w</td>
    	<td>[[:alpha:]]</td>
    	<td>\p{Alpha}</td>
    	<td>\p{L}</td>
    </tr>
    </table>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)

AWK (GNU Awk 4.1.3)

$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä
Run Code Online (Sandbox Code Playgroud)

AWK (GNU Awk 3.1.8)

$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
z
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä
Run Code Online (Sandbox Code Playgroud)

grep (GNU grep 2.25)

$ echo xuzvöä | LC_ALL=C grep [a-z]
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [a-z]
xuzvöä
$ echo xuzvöä | LC_ALL=C grep [[:alpha:]]
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [[:alpha:]]
xuzvöä
$ echo xuzvöä | LC_ALL=C grep \\w
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep \\w
xuzvöä
Run Code Online (Sandbox Code Playgroud)


Tod*_*obs 5

可能的区域错误

您面临的问题不是POSIX字符类本身,而是因为类依赖于语言环境.例如,正则表达式(7)说:

在括号表达式中,"[:"和":]"中包含的字符类的名称代表属于该类的所有字符的列表...这些代表wctype(3)中定义的字符类. 区域设置可以提供其他人.

重点是我的,但手册页显然是说字符类依赖于语言环境.此外,wctype(3)说:

wctype()的行为取决于当前语言环境的LC_CTYPE类别.

换句话说,如果您的语言环境错误地定义了一个字符类,那么它应该是针对特定语言环境提交的错误.另一方面,如果字符类只是以你不期望的方式定义字符集,那么它可能不是一个bug; 它可能只是一个需要编码的问题.

字符类作为快捷方式

字符类是定义集合的快捷方式.您当然不限于您的语言环境的预定义集,并且您可以自由使用perlre(1)定义的Unicode字符集,或者只是提供更高的准确性,只需显式创建集.

你已经知道了,所以我不是想要迂腐.我只是指出,如果你不能或不会修复语言环境(这里是问题的根源)那么你应该使用一个显式集,就像你所做的那样.

便利类只有在适合您的用例时才方便.如果没有,将它扔到船外!

  • 你需要补充一个事实:`[az]`是一个字符范围而不是一个类.`[:alpha:]`正在使用一个类 (2认同)