Unicode 中的保留字符代码

Hab*_*wad 1 unicode character reserved

为什么Unicode有几个保留字符代码?
请参阅两种语言的 Unicode -卡纳达语泰米尔语。这两种语言都很古老,我认为没有机会为这些语言添加新字符。
编辑:那么为什么他们通过保留字符代码来浪费一些字符代码呢?
为什么他们不将保留字符代码放在每种语言字符集的末尾?

tch*_*ist 5

这与 Unicode 联盟如何分配其分配的块、脚本和代码点有关。例如,在 中Block=Tamil,它的开头是这样运行的:

\n\n
$ unichars \'\\p{Block=Tamil}\' | head -20\nU+00B82 \xe2\x80\xad \xe2\x97\x8c\xe0\xae\x82  GC=Mn SC=Tamil        TAMIL SIGN ANUSVARA\nU+00B83 \xe2\x80\xad \xe0\xae\x83  GC=Lo SC=Tamil        TAMIL SIGN VISARGA\nU+00B85 \xe2\x80\xad \xe0\xae\x85  GC=Lo SC=Tamil        TAMIL LETTER A\nU+00B86 \xe2\x80\xad \xe0\xae\x86  GC=Lo SC=Tamil        TAMIL LETTER AA\nU+00B87 \xe2\x80\xad \xe0\xae\x87  GC=Lo SC=Tamil        TAMIL LETTER I\nU+00B88 \xe2\x80\xad \xe0\xae\x88  GC=Lo SC=Tamil        TAMIL LETTER II\nU+00B89 \xe2\x80\xad \xe0\xae\x89  GC=Lo SC=Tamil        TAMIL LETTER U\nU+00B8A \xe2\x80\xad \xe0\xae\x8a  GC=Lo SC=Tamil        TAMIL LETTER UU\nU+00B8E \xe2\x80\xad \xe0\xae\x8e  GC=Lo SC=Tamil        TAMIL LETTER E\nU+00B8F \xe2\x80\xad \xe0\xae\x8f  GC=Lo SC=Tamil        TAMIL LETTER EE\nU+00B90 \xe2\x80\xad \xe0\xae\x90  GC=Lo SC=Tamil        TAMIL LETTER AI\nU+00B92 \xe2\x80\xad \xe0\xae\x92  GC=Lo SC=Tamil        TAMIL LETTER O\nU+00B93 \xe2\x80\xad \xe0\xae\x93  GC=Lo SC=Tamil        TAMIL LETTER OO\nU+00B94 \xe2\x80\xad \xe0\xae\x94  GC=Lo SC=Tamil        TAMIL LETTER AU\nU+00B95 \xe2\x80\xad \xe0\xae\x95  GC=Lo SC=Tamil        TAMIL LETTER KA\nU+00B99 \xe2\x80\xad \xe0\xae\x99  GC=Lo SC=Tamil        TAMIL LETTER NGA\nU+00B9A \xe2\x80\xad \xe0\xae\x9a  GC=Lo SC=Tamil        TAMIL LETTER CA\nU+00B9C \xe2\x80\xad \xe0\xae\x9c  GC=Lo SC=Tamil        TAMIL LETTER JA\nU+00B9E \xe2\x80\xad \xe0\xae\x9e  GC=Lo SC=Tamil        TAMIL LETTER NYA\nU+00B9F \xe2\x80\xad \xe0\xae\x9f  GC=Lo SC=Tamil        TAMIL LETTER TTA\n
Run Code Online (Sandbox Code Playgroud)\n\n

它们倾向于将 4、8 或 16 个代码点的连续行保留给所有相同的 \xe2\x80\x9ckind\xe2\x80\x9d 字符。是的,那里有间隙,但它就像在文件系统中一样,一旦您将一个扇区(或块,如果您在块内没有单独的扇区)分配给一个文件,甚至如果该文件不使用其(最终)扇区中的所有内容,则您不会将这些未使用的字节交给其他进程。无论如何,事情往往会被填充到块边界。

\n\n

它\xe2\x80\x99不像我们\xe2\x80\x99那样有耗尽代码的风险。

\n\n

这里是分配区域的开头,以 \xe2\x80\x9cSigns\xe2\x80\x9d 开头,如该块中第一个分配的代码点所示。这种差距可能代表从一种性格到另一种性格的转变。如果您检查块中前五个代码点的属性,您会发现那些未分配的代码点仍然具有正确的块属性:

\n\n
$ uniprops -a U+00B80 U+00B81 U+00B82 U+00B83 U+00B84 U+00B85\nU+0B80 \xe2\x80\xb9U+0B80\xe2\x80\xba \\N{U+0B80}\n    \\pC \\p{Cn}\n    All Any InTamil C Other Cn Unassigned Zzzz Unknown\n    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n       CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n       Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n       JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None\n       Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX\n       Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX\nU+0B81 \xe2\x80\xb9U+0B81\xe2\x80\xba \\N{U+0B81}\n    \\pC \\p{Cn}\n    All Any InTamil C Other Cn Unassigned Zzzz Unknown\n    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n       CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n       Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n       JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None\n       Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX\n       Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX\nU+0B82 \xe2\x80\xb9\xe2\x97\x8c\xe0\xae\x82\xe2\x80\xba \\N{TAMIL SIGN ANUSVARA}\n    \\w \\pM \\p{Mn}\n    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil Case_Ignorable CI M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC\n       Mark Nonspacing_Mark Print Taml Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word\n    Age=1.1 Bidi_Class=Nonspacing_Mark BC=NSM Bidi_Class=NSM Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n       CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=EX\n       Grapheme_Cluster_Break=Extend GCB=EX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n       JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None\n       Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1\n       Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2\n       Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=EX Sentence_Break=Extend SB=EX Word_Break=Extend WB=Extend\nU+0B83 \xe2\x80\xb9\xe0\xae\x83\xe2\x80\xba \\N{TAMIL SIGN VISARGA}\n    \\w \\pL \\p{L_} \\p{Lo}\n    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter\n       L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word\n    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR\n       Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n       Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n       JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None\n       Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1\n       Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2\n       Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE\n       Word_Break=LE\nU+0B84 \xe2\x80\xb9U+0B84\xe2\x80\xba \\N{U+0B84}\n    \\pC \\p{Cn}\n    All Any InTamil C Other Cn Unassigned Zzzz Unknown\n    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n       CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n       Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n       JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None\n       Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX\n       Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX\nU+0B85 \xe2\x80\xb9\xe0\xae\x85\xe2\x80\xba \\N{TAMIL LETTER A}\n    \\w \\pL \\p{L_} \\p{Lo}\n    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter\n       L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word\n    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR\n       Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n       Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n       JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None\n       Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1\n       Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2\n       Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE\n       Word_Break=LE\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果您查看其他分配的块,您会看到同样的情况。将块分割成不相关的东西是没有意义的。

\n\n

正如我所说,\xe2\x80\x99 并不是说​​它们\xe2\x80\x99 会耗尽空间,所以我\xe2\x80\x99 不知道这里有什么问题。

\n\n

顺便说一句,您可以从我的Unicode 命令行工具箱中单独获取 Unicode 探索和处理工具,例如unicharsunipropsuninames ,或者通过CPAN套件获取整个套件。Unicode::Tussle

\n