为什么Java允许控制字符在其标识符中?

tch*_*ist 49 java variables unicode

在准确探索Java标识符中允许哪些字符时,我偶然发现了一些非常好奇的东西,似乎几乎肯定是一个bug.

我希望发现Java标识符符合以下要求:它们以具有Unicode属性的字符开头,ID_Start后跟具有该属性的字符,并且ID_Continue为前导下划线和美元符号授予例外.事实证明并非如此,而且我发现与我听说过的普通标识符或其他任何其他想法极为不同.

短期演示

请考虑以下演示,证明Java标识符中允许使用ASCII ESC字符(八进制033):

$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: \033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[
Run Code Online (Sandbox Code Playgroud)

不过,情况甚至更糟.实际上,几乎无限恶化.甚至允许NULL!还有数千个甚至不是标识符字符的其他代码点.我在Solaris,Linux和运行Darwin的Mac上测试了这一点,并且都给出了相同的结果.

长演示

这是一个测试程序,它将显示Java非常不允许作为合法标识符名称的一部分的所有这些意外代码点.

#!/usr/bin/env perl
# 
# test-java-idchars - find which bogus code points Java allows in its identifiers
# 
#   usage: test-java-idchars [low high]
#   e.g.:  test-java-idchars 0 255
#
# Without arguments, tests Unicode code points
# from 0 .. 0x1000.  You may go further with a
# higher explicit argument.
#
# Produces a report at the end.
#
# You can ^C it prematurely to end the program then
# and get a report of its progress up to that point.
#
# Tom Christiansen
# tchrist@perl.com
# Sat Jan 29 10:41:09 MST 2011

use strict;
use warnings;

use encoding "Latin1";
use open IO => ":utf8";

use charnames ();

$| = 1;

my @legal;

my ($start, $stop) = (0, 0x1000);

if (@ARGV != 0) {
    if (@ARGV == 1) {
        for (($stop) = @ARGV) { 
            $_ = oct if /^0/;   # support 0OCTAL, 0xHEX, 0bBINARY
        }
    }
    elsif (@ARGV == 2) {
        for (($start, $stop) = @ARGV) { 
            $_ = oct if /^0/;
        }
    } 
    else {
        die "usage: $0 [ [start] stop ]\n";
    } 
} 

for my $cp ( $start .. $stop ) {
    my $char = chr($cp);

    next if $char =~ /[\s\w]/;

    my $type = "?";
    for ($char) {
        $type = "Letter"      if /\pL/;
        $type = "Mark"        if /\pM/;
        $type = "Number"      if /\pN/;
        $type = "Punctuation" if /\pP/;
        $type = "Symbol"      if /\pS/;
        $type = "Separator"   if /\pZ/;
        $type = "Control"     if /\pC/;
    } 
    my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL";
    next if $name eq "<missing>" && $cp > 0xFF;
    my $msg = sprintf("U+%04X %s", $cp, $name);
    print "testing \\p{$type} $msg...";
    open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!;

print TESTPROGRAM <<"End_of_Java_Program";

public class testchar { 
    public static void main(String argv[]) { 
        String var_$char = "variable name ends in $msg";
        System.out.println(var_$char); 
    }
}

End_of_Java_Program

    close(TESTPROGRAM) || die $!;

    system q{
        ( javac -encoding UTF-8 testchar.java \
            && \
          java -Dfile.encoding=UTF-8 testchar | grep variable \
        ) >/dev/null 2>&1
    };

    push @legal, sprintf("U+%04X", $cp) if $? == 0;

    if ($? && $? < 128) {
        print "<interrupted>\n";
        exit;  # from a ^C
    } 

    printf "is %s in Java identifiers.\n",  
        ($? == 0) ? uc "legal" : "forbidden";

} 

END { 
    print "Legal but evil code points: @legal\n";
}
Run Code Online (Sandbox Code Playgroud)

以下是仅在前三个代码点上运行该程序的示例,这些代码点既不是空格也不是标识符字符:

$ perl test-java-idchars 0 0x20
testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers.
testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers.
testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers.
testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers.
testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers.
testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers.
testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers.
testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers.
testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers.
testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers.
testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers.
testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers.
testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers.
testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers.
testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers.
testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers.
testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers.
testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers.
testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers.
testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers.
testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers.
testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers.
Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B
Run Code Online (Sandbox Code Playgroud)

这是另一个演示:

$ perl test-java-idchars 0x600 0x700 | grep -i legal
testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers.
testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers.
testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers.
testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers.
testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers.
Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD
Run Code Online (Sandbox Code Playgroud)

问题

任何人都可以解释这个看似疯狂的行为吗?整个地方有很多很多其他令人费解的许可代码点,从U + 0000开始,这可能是最奇怪的.如果在第一个0x1000代码点上运行它,则会看到某些模式出现,例如允许使用该属性的任何和所有代码点Current_Symbol.但是,至少对我来说,太多其他事情是完全无法解释的.

nin*_*alj 15

Java语言规范第3.8节推迟到Character.isJavaIdentifierStart()Character.isJavaIdentifierPart() .后者以及其他条件具有Character.isIdentifierIgnorable(),它允许非空白控制字符(包括整个C1范围,请参阅列表的链接).

  • 那么Java又一次决定以与Unicode标准不一致的方式制作自己的事物定义?我真的很想知道为什么 - **最多** - Unicode的"Default_Ignorable_Code_Point"属性证明不足以实现它们的神秘目的,以及为什么它们必须发明自己的与Unicode相矛盾的定义.它具有与Java相同的意义,它有自己的白色空间概念,与Unicode不同 - 它确实具有. (9认同)
  • @tchrist:我知道我没有回答"为什么",但是说真的,只有Java的创造者可以给你一个明确的答案,所以你应该尝试接触它们,它应该比你更容易我们其他人. (7认同)
  • 这并没有解释许多令人费解的许可代码点,例如U + 0602阿拉伯语脚注标记和U + 070F SYRIAC缩写标记和U + 2062不可见时间都是Gc = Cf又名General_Category =格式.那些不是类型`\ w`字符!什么是**隐形**,反正呢?他们吸烟的是什么?你无法分辨哪个是重要的,哪个是可忽略的:不等的字符串不应该测试相等.不妨使用不仅不区分大小写但所有元音测试相等的标识符.产生同样多(非)意义! (2认同)

Avi*_*Avi 8

另一个问题可能是:为什么Java不允许在其标识符中使用控制字符?

在设计语言或其他系统时,一个好的原则是不要没有正当理由禁止任何事情,因为你永远不知道如何使用它,实施者和用户必须应对的规则越少越好.

确实,您当然不应该利用这一点,通过将转义实际嵌入到您的变量名中,并且您将看不到任何公开的类,其中包含带有空字符的类.

当然,这可能会被滥用,但是语言设计师的工作不是以这种方式保护程序员自己,而是通过强制适当的缩进或精心选择的变量名称.

  • @Avi:Java不应该在其标识符中允许控制字符的原因有很多!首先,控制字符是*不是*ID_Start字符,它们是*不是*ID_Continue字符,**和**它们是*不是*Default_Ignorable_Code_Point字符.另一方面,它们是看不见的.而对于第三个,这只是简单的草率.这是糟糕的设计!你有两个不相等的标识符,就好像它们是相同的,即使它们不相同.真是一团糟! (9认同)
  • 首先,IS_Start和ID_Continue,AFAIK,在定义Java语言语法之后的UAX#31(Unicode 5.0的附件)中定义.此时对Java中的合法字符的任何更改都将是不必要的,不向后兼容的更改. (4认同)
  • @Avi:事实上,BS(甚至不会开始谈论CSI)在Java中是合法的,这可能会使代码审计变得更加有趣. (4认同)
  • @Avi:下意识地调用“向后可鄙”几乎没有促进人们认识到标准的存在。有一些方法可以兼容新标准,同时仍然以向后方式编译旧代码。忽视未来就等于死亡。 (2认同)
  • @tchrist:同意,这就是为什么我写"*不必要*,非向后兼容".如果它是一个大问题,那么打破向后兼容性或为新代码制定更严格的规则是值得的,例如在引入时对`assert`和`enum`关键字进行了修改.但事实并非如此 - 无论如何,谁会想到将控制字符放在源标识符中,如果他们这样做,谁会关心? (2认同)