这与 Unicode 联盟如何分配其分配的块、脚本和代码点有关。例如,在 中Block=Tamil,它的开头是这样运行的:
$ unichars \'\\p{Block=Tamil}\' | head -20\nU+00B82 \xe2\x80\xad \xe2\x97\x8c\xe0\xae\x82 GC=Mn SC=Tamil TAMIL SIGN ANUSVARA\nU+00B83 \xe2\x80\xad \xe0\xae\x83 GC=Lo SC=Tamil TAMIL SIGN VISARGA\nU+00B85 \xe2\x80\xad \xe0\xae\x85 GC=Lo SC=Tamil TAMIL LETTER A\nU+00B86 \xe2\x80\xad \xe0\xae\x86 GC=Lo SC=Tamil TAMIL LETTER AA\nU+00B87 \xe2\x80\xad \xe0\xae\x87 GC=Lo SC=Tamil TAMIL LETTER I\nU+00B88 \xe2\x80\xad \xe0\xae\x88 GC=Lo SC=Tamil TAMIL LETTER II\nU+00B89 \xe2\x80\xad \xe0\xae\x89 GC=Lo SC=Tamil TAMIL LETTER U\nU+00B8A \xe2\x80\xad \xe0\xae\x8a GC=Lo SC=Tamil TAMIL LETTER UU\nU+00B8E \xe2\x80\xad \xe0\xae\x8e GC=Lo SC=Tamil TAMIL LETTER E\nU+00B8F \xe2\x80\xad \xe0\xae\x8f GC=Lo SC=Tamil TAMIL LETTER EE\nU+00B90 \xe2\x80\xad \xe0\xae\x90 GC=Lo SC=Tamil TAMIL LETTER AI\nU+00B92 \xe2\x80\xad \xe0\xae\x92 GC=Lo SC=Tamil TAMIL LETTER O\nU+00B93 \xe2\x80\xad \xe0\xae\x93 GC=Lo SC=Tamil TAMIL LETTER OO\nU+00B94 \xe2\x80\xad \xe0\xae\x94 GC=Lo SC=Tamil TAMIL LETTER AU\nU+00B95 \xe2\x80\xad \xe0\xae\x95 GC=Lo SC=Tamil TAMIL LETTER KA\nU+00B99 \xe2\x80\xad \xe0\xae\x99 GC=Lo SC=Tamil TAMIL LETTER NGA\nU+00B9A \xe2\x80\xad \xe0\xae\x9a GC=Lo SC=Tamil TAMIL LETTER CA\nU+00B9C \xe2\x80\xad \xe0\xae\x9c GC=Lo SC=Tamil TAMIL LETTER JA\nU+00B9E \xe2\x80\xad \xe0\xae\x9e GC=Lo SC=Tamil TAMIL LETTER NYA\nU+00B9F \xe2\x80\xad \xe0\xae\x9f GC=Lo SC=Tamil TAMIL LETTER TTA\nRun Code Online (Sandbox Code Playgroud)\n\n它们倾向于将 4、8 或 16 个代码点的连续行保留给所有相同的 \xe2\x80\x9ckind\xe2\x80\x9d 字符。是的,那里有间隙,但它就像在文件系统中一样,一旦您将一个扇区(或块,如果您在块内没有单独的扇区)分配给一个文件,甚至如果该文件不使用其(最终)扇区中的所有内容,则您不会将这些未使用的字节交给其他进程。无论如何,事情往往会被填充到块边界。
\n\n它\xe2\x80\x99不像我们\xe2\x80\x99那样有耗尽代码的风险。
\n\n这里是分配区域的开头,以 \xe2\x80\x9cSigns\xe2\x80\x9d 开头,如该块中第一个分配的代码点所示。这种差距可能代表从一种性格到另一种性格的转变。如果您检查块中前五个代码点的属性,您会发现那些未分配的代码点仍然具有正确的块属性:
\n\n$ uniprops -a U+00B80 U+00B81 U+00B82 U+00B83 U+00B84 U+00B85\nU+0B80 \xe2\x80\xb9U+0B80\xe2\x80\xba \\N{U+0B80}\n \\pC \\p{Cn}\n All Any InTamil C Other Cn Unassigned Zzzz Unknown\n Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None\n Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX\n Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX\nU+0B81 \xe2\x80\xb9U+0B81\xe2\x80\xba \\N{U+0B81}\n \\pC \\p{Cn}\n All Any InTamil C Other Cn Unassigned Zzzz Unknown\n Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None\n Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX\n Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX\nU+0B82 \xe2\x80\xb9\xe2\x97\x8c\xe0\xae\x82\xe2\x80\xba \\N{TAMIL SIGN ANUSVARA}\n \\w \\pM \\p{Mn}\n All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil Case_Ignorable CI M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC\n Mark Nonspacing_Mark Print Taml Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word\n Age=1.1 Bidi_Class=Nonspacing_Mark BC=NSM Bidi_Class=NSM Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=EX\n Grapheme_Cluster_Break=Extend GCB=EX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None\n Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1\n Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2\n Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=EX Sentence_Break=Extend SB=EX Word_Break=Extend WB=Extend\nU+0B83 \xe2\x80\xb9\xe0\xae\x83\xe2\x80\xba \\N{TAMIL SIGN VISARGA}\n \\w \\pL \\p{L_} \\p{Lo}\n All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter\n L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word\n Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR\n Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None\n Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1\n Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2\n Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE\n Word_Break=LE\nU+0B84 \xe2\x80\xb9U+0B84\xe2\x80\xba \\N{U+0B84}\n \\pC \\p{Cn}\n All Any InTamil C Other Cn Unassigned Zzzz Unknown\n Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered\n CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None\n Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX\n Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX\nU+0B85 \xe2\x80\xb9\xe0\xae\x85\xe2\x80\xba \\N{TAMIL LETTER A}\n \\w \\pL \\p{L_} \\p{Lo}\n All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter\n L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word\n Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR\n Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX\n Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group\n JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None\n Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1\n Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2\n Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE\n Word_Break=LE\nRun Code Online (Sandbox Code Playgroud)\n\n如果您查看其他分配的块,您会看到同样的情况。将块分割成不相关的东西是没有意义的。
\n\n正如我所说,\xe2\x80\x99 并不是说它们\xe2\x80\x99 会耗尽空间,所以我\xe2\x80\x99 不知道这里有什么问题。
\n\n顺便说一句,您可以从我的Unicode 命令行工具箱中单独获取 Unicode 探索和处理工具,例如unichars、 uniprops、 uninames ,或者通过CPAN套件获取整个套件。Unicode::Tussle