Java中如何获取字符类型的类别名称?

Wic*_*koo 6 java unicode character

返回Character.getType(int codePoint)一个整数,但我找不到从中获取 unicode 类别名称(例如“Lu”或“Cn”)的方法。Character.getCategoryTypeName(int codePoint)我想要的是一个返回表示类型的字符串的方法。

类别名称位于常量注释中,一种方法是为返回的类型编写一个 switch case,然后手动对类型名称进行编码,如下所示:

我原来的计划是这样的:

for (int i = 0; i <= 0x10FFFF; i++) {
    switch (Character.getType(i)) {

        // General category "Sc" in the Unicode specification.
        // public static final byte CURRENCY_SYMBOL = 26;
        case Character.CURRENCY_SYMBOL: 
            map.put(i, "Sc");
            break;

        ....
     }
}
Run Code Online (Sandbox Code Playgroud)

但这会非常乏味。是否有自动方法或库来完成任务?

Bas*_*que 3

正如所评论的,目前 Java 似乎没有捆绑这样的功能。正如Marcano1234 评论的那样,功能请求已经记录在案,但尚未实现。

\n

自我提醒:如果我在会议上与Brian GoetzMark Reinhold接触过,请要求/恳求/恳求他们对Java 中的代码点Unicode进行重大修改:Project\xc2\xa0Papyrus

\n

我想出了几种方法来为 Unicode 定义的 30 个“常规类别”项目中的每一个生成您想要的两个字母的名称。(这个两个字母的名称在 Unicode 规范中称为 \xe2\x80\x9calias\xe2\x80\x9d。)我的实现之一只是switch对每个别名进行硬编码。另一个更复杂,定义了几个枚举。

\n

这两个实现都是我创建的,只是作为练习。我没有在生产中使用过它们。我并不是说它们是最好的路线,但希望它们可能有用,或者至少激发其他人做出更好的努力。

\n

基本的

\n

在 Java 14+ 中,switch对类上定义的每个“常规类别”常量使用表达式Character。如果不熟悉,请参阅JEP 361:切换表达式

\n
public String unicodeGeneralCategoryAliasForCodePoint ( int codePoint ) {\n    return switch ( Character.getType( codePoint ) ) {\n\n        // L, Letter\n        case Character.UPPERCASE_LETTER -> "Lu";\n        case Character.LOWERCASE_LETTER -> "Ll";\n        case Character.TITLECASE_LETTER -> "Lt";\n        case Character.MODIFIER_LETTER -> "Lm";\n        case Character.OTHER_LETTER -> "Lo";\n\n        // M, Mark\n        case Character.NON_SPACING_MARK -> "Mn";\n        case Character.COMBINING_SPACING_MARK -> "Mc";\n        case Character.ENCLOSING_MARK -> "Me";\n\n        // N, Number\n        case Character.DECIMAL_DIGIT_NUMBER -> "Nd";\n        case Character.LETTER_NUMBER -> "Nl";\n        case Character.OTHER_NUMBER -> "No";\n\n        // P, Punctuation\n        case Character.CONNECTOR_PUNCTUATION -> "Pc";\n        case Character.DASH_PUNCTUATION -> "Pd";\n        case Character.START_PUNCTUATION -> "Ps";\n        case Character.END_PUNCTUATION -> "Pe";\n        case Character.INITIAL_QUOTE_PUNCTUATION -> "Pi";\n        case Character.FINAL_QUOTE_PUNCTUATION -> "Pf";\n        case Character.OTHER_PUNCTUATION -> "Po";\n\n        // S, Symbol\n        case Character.MATH_SYMBOL -> "Sm";\n        case Character.CURRENCY_SYMBOL -> "Sc";\n        case Character.MODIFIER_SYMBOL -> "Sk";\n        case Character.OTHER_SYMBOL -> "So";\n\n        // Z, Separator\n        case Character.SPACE_SEPARATOR -> "Zs";\n        case Character.LINE_SEPARATOR -> "Zl";\n        case Character.PARAGRAPH_SEPARATOR -> "Zp";\n\n        // C, Other\n        case Character.CONTROL -> "Cc";\n        case Character.FORMAT -> "Cf";\n        case Character.SURROGATE -> "Cs";\n        case Character.PRIVATE_USE -> "Co";\n        case Character.UNASSIGNED -> "Cn";\n\n        default -> "ERROR - Unexpected General Category type for code point " + codePoint + ". Message # 5d44e5fd-d60e-4b02-9431-ad57c56657f5.";\n    }; \n}\n
Run Code Online (Sandbox Code Playgroud)\n

用法:

\n
String alias = x.unicodeGeneralCategoryAliasForCodePoint( 65 );\n
Run Code Online (Sandbox Code Playgroud)\n
\n

\n
\n

豪华房

\n

在这种替代方法中,我定义了一对枚举:

\n
    \n
  • UnicodeGeneralCategory
    30 个对象,每个对象对应Unicode 13 规范第 4.5 节第 170-172 页中定义的一般类别项目。这些项目列在该维基百科页面上。
  • \n
  • UnicodeMajorClass
    将对象分组UnicodeGeneralCategory为由 Unicode 规范定义的组:字母、标记、数字、标点符号、符号、分隔符等。最后一个“其他”值得注意,因为它涵盖了不可打印的 \xe2\x80\x9ccontrol\xe2\x80\x9d 字符以及未分配给任何字符的绝大多数代码点。当循环遍历所有可能的代码点时,我们想跳过这些。
  • \n
\n

我的枚举对象的名称UnicodeGeneralCategory是从类上声明的常量子集复制的,Character该类的描述以短语 \xe2\x80\x9cGeneralcategory\xe2\x80\x9d 开头。这些常量名称与官方 Unicode 名称有些不同,但足够接近。我按照 Unicode 规范中列出的顺序定义了它们。

\n
String alias = x.unicodeGeneralCategoryAliasForCodePoint( 65 );\n
Run Code Online (Sandbox Code Playgroud)\n

\xe2\x80\xa6 和 \xe2\x80\xa6

\n
package work.basil.unicode.category;\n\nimport java.util.Arrays;\nimport java.util.Optional;\n\n// For more info about Unicode General Category, see section 4.5 of the Unicode 13.0 spec, pages 170-172.\n// https://www.unicode.org/versions/Unicode13.0.0/ch04.pdf\npublic enum UnicodeGeneralCategory {\n\n    // See Wikipedia page list the General Category values defined in Unicode 13.\n    // L, Letter\n    UPPERCASE_LETTER( Character.UPPERCASE_LETTER , "Lu" , "Letter" , "uppercase" ),\n    LOWERCASE_LETTER( Character.LOWERCASE_LETTER , "Ll" , "Letter" , "lowercase" ),\n    TITLECASE_LETTER( Character.TITLECASE_LETTER , "Lt" , "Letter" , "titlecase" ),\n    MODIFIER_LETTER( Character.MODIFIER_LETTER , "Lm" , "Letter" , "modifier" ),\n    OTHER_LETTER( Character.OTHER_LETTER , "Lo" , "Letter" , "other" ),\n\n    // M, Mark\n    NON_SPACING_MARK( Character.NON_SPACING_MARK , "Mn" , "Mark" , "nonspacing" ),\n    COMBINING_SPACING_MARK( Character.COMBINING_SPACING_MARK , "Mc" , "Mark" , "spacing combining" ),\n    ENCLOSING_MARK( Character.ENCLOSING_MARK , "Me" , "Mark" , "enclosing" ),\n\n    // N, Number\n    DECIMAL_DIGIT_NUMBER( Character.DECIMAL_DIGIT_NUMBER , "Nd" , "Number" , "decimal digit" ),\n    LETTER_NUMBER( Character.LETTER_NUMBER , "Nl" , "Number" , "letter" ),\n    OTHER_NUMBER( Character.OTHER_NUMBER , "No" , "Number" , "other" ),\n\n    // P, Punctuation\n    CONNECTOR_PUNCTUATION( Character.CONNECTOR_PUNCTUATION , "Pc" , "Punctuation" , "connector" ),\n    DASH_PUNCTUATION( Character.DASH_PUNCTUATION , "Pd" , "Punctuation" , "dash" ),\n    START_PUNCTUATION( Character.START_PUNCTUATION , "Ps" , "Punctuation" , "open" ),\n    END_PUNCTUATION( Character.END_PUNCTUATION , "Pe" , "Punctuation" , "close" ),\n    INITIAL_QUOTE_PUNCTUATION( Character.INITIAL_QUOTE_PUNCTUATION , "Pi" , "Punctuation" , "initial quote" ),\n    FINAL_QUOTE_PUNCTUATION( Character.FINAL_QUOTE_PUNCTUATION , "Pf" , "Puntuation" , "final quote" ),\n    OTHER_PUNCTUATION( Character.OTHER_PUNCTUATION , "Po" , "Punctuation" , "other" ),\n\n    // S, Symbol\n    MATH_SYMBOL( Character.MATH_SYMBOL , "Sm" , "Symbol" , "math" ),\n    CURRENCY_SYMBOL( Character.CURRENCY_SYMBOL , "Sc" , "Symbol" , "currency" ),\n    MODIFIER_SYMBOL( Character.MODIFIER_SYMBOL , "Sk" , "Symbol" , "modifier" ),\n    OTHER_SYMBOL( Character.OTHER_SYMBOL , "So" , "Symbol" , "other" ),\n\n    // Z, Separator\n    SPACE_SEPARATOR( Character.SPACE_SEPARATOR , "Zs" , "Separator" , "space" ),\n    LINE_SEPARATOR( Character.LINE_SEPARATOR , "Zl" , "Separator" , "line" ),\n    PARAGRAPH_SEPARATOR( Character.PARAGRAPH_SEPARATOR , "Zp" , "Separator" , "paragraph" ),\n\n    // C, Other\n    CONTROL( Character.CONTROL , "Cc" , "Other" , "control" ),\n    FORMAT( Character.FORMAT , "Cf" , "Other" , "format" ),\n    SURROGATE( Character.SURROGATE , "Cs" , "Other" , "surrogate" ),\n    PRIVATE_USE( Character.PRIVATE_USE , "Co" , "Other" , "private use" ),\n    UNASSIGNED( Character.UNASSIGNED , "Cn" , "Other" , "not assigned" );\n\n\n    // Fields.\n    private byte characterClassConstantForGeneralCategory;\n    private String alias, major, minor;\n\n    // Constructor.\n    UnicodeGeneralCategory ( byte characterClassConstantForGeneralCategory , String alias , String major , String minor ) {\n        this.characterClassConstantForGeneralCategory = characterClassConstantForGeneralCategory;\n        this.alias = alias;\n        this.major = major;\n        this.minor = minor;\n    }\n\n    public static UnicodeGeneralCategory forCodePoint ( int codePoint ) {\n        if ( ! Character.isValidCodePoint( codePoint ) ) {\n            throw new IllegalArgumentException( "Code point " + codePoint + " is invalid. Must be within 0 to U+10FFFF ( 1,114,111 ) inclusive." );\n        }\n        Optional < UnicodeGeneralCategory > optionalUnicodeGeneralCategory = Arrays.stream( UnicodeGeneralCategory.values() ).filter( category -> category.characterClassConstantForGeneralCategory == Character.getType( codePoint ) ).findAny();\n        if ( optionalUnicodeGeneralCategory.isEmpty() ) {\n            throw new IllegalStateException( "No general category defined in this enum matching `Character.getType( codePoint )`: " + Character.getType( codePoint ) );\n        } else {\n            return optionalUnicodeGeneralCategory.get();\n        }\n    }\n\n    public static UnicodeGeneralCategory forAlias ( String abbrev ) {\n        Optional < UnicodeGeneralCategory > optionalUnicodeGeneralCategory = Arrays.stream( UnicodeGeneralCategory.values() ).filter( category -> category.alias == abbrev ).findAny();\n        if ( optionalUnicodeGeneralCategory.isEmpty() ) {\n            throw new IllegalArgumentException( "No general category defined in this enum for abbreviation " + abbrev );\n        } else {\n            return optionalUnicodeGeneralCategory.get();\n        }\n    }\n\n    // Getters\n    public String getAlias () {\n        return this.alias;\n    }\n\n    public String getMajor () {\n        return this.major;\n    }\n\n    public String getMinor () {\n        return this.minor;\n    }\n\n    public byte getCharacterClassConstant () {\n        return this.characterClassConstantForGeneralCategory;\n    }\n\n    public String getDisplayName () {\n        return this.alias + " \xe2\x80\x93 " + this.major + ", " + this.minor;\n    }\n\n}\n\n
Run Code Online (Sandbox Code Playgroud)\n

用途:

\n
    \n
  • UnicodeGeneralCategory.forCodePoint( yourCodePointGoesHere ).getAlias()是您在问题顶部所要求的。
  • \n
  • UnicodeMajorClass.C_OTHER.coversCodePoint( codePoint )跳过那些讨厌的不可打印/未分配的代码点。
  • \n
\n

此外,要获取String由代码点编号表示的单个字符的 a,请调用Character.toString( codePoint )

\n

我们可以使用这两个枚举来报告所有代码点。

\n
package work.basil.unicode.category;\n\nimport java.util.EnumSet;\nimport java.util.Set;\n\npublic enum UnicodeMajorClass {\n    L_Letter( "L" , "Letter" , EnumSet.of( UnicodeGeneralCategory.UPPERCASE_LETTER , UnicodeGeneralCategory.LOWERCASE_LETTER , UnicodeGeneralCategory.TITLECASE_LETTER , UnicodeGeneralCategory.MODIFIER_LETTER , UnicodeGeneralCategory.OTHER_LETTER ) ),\n    M_MARK( "M" , "Mark" , EnumSet.of( UnicodeGeneralCategory.NON_SPACING_MARK , UnicodeGeneralCategory.COMBINING_SPACING_MARK , UnicodeGeneralCategory.ENCLOSING_MARK ) ),\n    N_NUMBER( "N" , "Number" , EnumSet.of( UnicodeGeneralCategory.DECIMAL_DIGIT_NUMBER , UnicodeGeneralCategory.LETTER_NUMBER , UnicodeGeneralCategory.OTHER_LETTER ) ),\n    P_PUNCTUATION( "P" , "Punctuation" , EnumSet.of( UnicodeGeneralCategory.CONNECTOR_PUNCTUATION , UnicodeGeneralCategory.DASH_PUNCTUATION , UnicodeGeneralCategory.START_PUNCTUATION , UnicodeGeneralCategory.END_PUNCTUATION , UnicodeGeneralCategory.INITIAL_QUOTE_PUNCTUATION , UnicodeGeneralCategory.FINAL_QUOTE_PUNCTUATION , UnicodeGeneralCategory.OTHER_PUNCTUATION ) ),\n    S_SYMBOL( "S" , "Symbol" , EnumSet.of( UnicodeGeneralCategory.MATH_SYMBOL , UnicodeGeneralCategory.CURRENCY_SYMBOL , UnicodeGeneralCategory.MODIFIER_SYMBOL , UnicodeGeneralCategory.OTHER_SYMBOL ) ),\n    Z_SEPARATOR( "Z" , "Separator" , EnumSet.of( UnicodeGeneralCategory.SPACE_SEPARATOR , UnicodeGeneralCategory.LINE_SEPARATOR , UnicodeGeneralCategory.PARAGRAPH_SEPARATOR ) ),\n    C_OTHER( "C" , "Other" , EnumSet.of( UnicodeGeneralCategory.CONTROL , UnicodeGeneralCategory.FORMAT , UnicodeGeneralCategory.SURROGATE , UnicodeGeneralCategory.PRIVATE_USE , UnicodeGeneralCategory.UNASSIGNED ) );\n\n    private String alias;\n    private String name;\n    private Set < UnicodeGeneralCategory > categories;\n\n    UnicodeMajorClass ( String alias , String name , Set < UnicodeGeneralCategory > categories ) {\n        this.alias = alias;\n        this.name = name;\n        this.categories = categories;\n    }\n\n    public String getAlias () {\n        return alias;\n    }\n\n    public String getName () {\n        return name;\n    }\n\n    public Set < UnicodeGeneralCategory > getCategories () {\n        return categories;\n    }\n\n    public String getDisplayName () {\n        return this.alias + " \xe2\x80\x93 " + this.name;\n    }\n\n    public boolean coversCodePoint ( int codePoint ) {\n        return this.getCategories().contains( UnicodeGeneralCategory.forCodePoint( codePoint ) );\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n

运行时:

\n
package work.basil.text;\n\nimport work.basil.unicode.category.UnicodeMajorClass;\n\npublic class DumpCharacters {\n    public static void main ( String[] args ) {\n        System.out.println( "INFO - Demo starting. " );\n\n        for ( int codePoint = 0 ; codePoint <= Character.MAX_CODE_POINT ; codePoint++ ) {\n            if ( Character.isValidCodePoint( codePoint ) )    // If code point is valid.\n            {\n                if ( UnicodeMajorClass.C_OTHER.coversCodePoint( codePoint ) ) // If control character.\n                {\n                    // No code needed. Skip over this code point as it is not a printable character.\n                } else {\n                    System.out.println( codePoint + " code point is named: " + Character.getName( codePoint ) + " = " + Character.toString( codePoint ) );\n                }\n            } else {\n                System.out.println( "ERROR - Invalid code point number: " + codePoint );\n            }\n        }\n\n        System.out.println( "INFO - Demo ending. " );\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n