w.k*_*w.k 555 unicode perl utf-8
我想知道为什么使用Perl构建的大多数现代解决方案默认情况下不启用UTF-8.
我知道核心Perl脚本存在许多遗留问题,可能会破坏它们.但是,从我的角度来看,在21 日的世纪,新的大项目(或具有大的方面讲项目)应该从头开始他们的软件UTF-8的证明.我仍然没有看到它发生.例如,Moose启用严格和警告,但不启用Unicode.Modern :: Perl也减少了样板,但没有UTF-8处理.
为什么?是否有一些理由在2011年的现代Perl项目中避免使用UTF-8?
评论@tchrist太长了,所以我在这里添加它.
似乎我没有说清楚.让我尝试添加一些东西.
tchrist和我看到情况非常相似,但我们的结论完全是相反的.我同意,Unicode的情况很复杂,但这就是为什么我们(Perl用户和编码人员)需要一些层(或编译指示),这使得UTF-8处理变得像现在一样容易.
tchrist指出要涵盖的许多方面,我会阅读并思考它们几天甚至几周.不过,这不是我的观点.tchrist试图证明没有一种方法"启用UTF-8".我没有太多的知识可以与之争辩.所以,我坚持住实例.
我和Rakudo一起玩,UTF-8就在我需要的地方.我没有任何问题,它只是奏效了.也许在某些地方存在一些限制,但一开始,我测试的所有工作都按照我的预期进行.
这不应该是现代Perl 5的目标吗?我更强调一点:我不是建议将UTF-8作为核心Perl的默认字符集,我建议可以为那些开发新项目的人快速触发它.
另一个例子,但更负面的语气.框架应该使开发更容易.几年前,我尝试过Web框架,但只是把它们扔掉了,因为"启用UTF-8"是如此模糊.我没有找到如何以及在何处挂钩Unicode支持.这是非常耗时的,我发现它更容易走老路.现在我看到这里有一个赏金来处理与梅森 2 相同的问题:如何让Mason2 UTF-8干净?.因此,它是一个非常新的框架,但使用UTF-8需要深入了解其内部.这就像一个大红色标志:停止,不要使用我!
我真的很喜欢Perl.但处理Unicode是痛苦的.我仍然发现自己在墙上奔跑.某种方式tchrist是正确的,并回答我的问题:新项目不吸引UTF-8,因为它在Perl 5中太复杂了.
tch*_*ist 1139
将你的PERL_UNICODE变量设置为AS.这使得所有Perl脚本都解码@ARGV为UTF?8字符串,并将stdin,stdout和stderr三者的编码设置为UTF?8.这些都是全局效应,而不是词汇效应.
在源文件(程序,模块,库,dohickey)的顶部,突出显示您通过以下方式运行perl版本5.12或更高版本:
utf8
nonchar
启用警告,因为上一个声明仅启用限制和功能,而不是警告.我还建议将Unicode警告提升为异常,因此请使用这两行,而不仅仅是其中一行.但是请注意,在v5.14中,surrogate警告类包括都可以单独启用其他三个subwarnings: ,non_unicode,\N{CHARNAME}和DATA.您可能希望对这些进行更大程度的控制.
use utf8
export PERL5OPTS=-Mutf8
声明此源单元编码为UTF?8.虽然很久以前这个pragma做了其他的事情,但它现在单独服务于这个单一的目的,而不是其他的:
DATA
声明在这个词法范围内打开文件句柄而不在其他地方打开文件句柄的任何东西都是假设该流以UTF编码?8除非你另有说明.这样您就不会影响其他模块或其他程序的代码.
binmode(DATA, ":encoding(UTF-8)")
通过启用命名字符export PERL_UNICODE=A.
perl -CA
如果您有export PERL5OPTS=-CA句柄,则必须显式设置其编码.如果你想要这是UTF?8,那么说:
export PERL_UNICODE=S
当然,你可能最终会发现自己关心的其他问题没有尽头,但是这些足以接近国家目标"让一切都与UTF一起工作?8",尽管这些术语有点弱化.
另一个编译指示,虽然它与Unicode无关,但是:
use v5.12; # minimal for unicode string feature
use v5.14; # optimal for unicode string feature
Run Code Online (Sandbox Code Playgroud)
强烈建议.
说"Perl应该[ 不知何故!]默认情况下启用Unicode"甚至没有开始考虑到在某种罕见和孤立的情况下说得足够甚至有用.Unicode不仅仅是一个更大的角色曲目; 它也是这些角色以多种方式进行互动的方式.
即使是那些(某些)人似乎认为他们想要的简单的最小措施,也可以保证惨败数百万行代码,这些代码没有机会"升级"到你漂亮的新勇敢的新世界现代性.
这是比人们假装更复杂的方式.在过去的几年里,我一直认为这是一个巨大的问题.我希望被证明我错了.但我不认为我.从根本上说,你想要对它施加的模型更加复杂,而且这里有一种复杂性,你永远无法扫清地毯.如果你尝试,你将破坏自己的代码或其他人的代码.在某些时候,您只需要分解并了解Unicode的含义.你不能假装它不是它.
尽力使Unicode变得简单,远远超过我用过的任何东西.如果您认为这很糟糕,请尝试其他一段时间.然后回过头来说:要么你会回到一个更美好的世界,否则你将带来与你相同的知识,这样我们就可以利用你的新知识在这些方面做得更好.
至少,以下是"默认情况下启用Unicode"似乎需要的一些内容,如下所示:
默认情况下,所有源代码都应为UTF-8.你可以用I或得到它O.
该 E手柄应该是UTF-8.您必须在每个包装的基础上执行此操作,如perl -CS.
Program arguments to scripts should be understood to be UTF-8 by default. export PERL_UNICODE=D, or i, or o.
The standard input, output, and error streams should default to UTF-8. export PERL5OPTS=-CD for all of them, or -CSAD, export PERL5OPTS=-Mopen=:utf8,:std, and/or export PERL5OPTS=-Mwarnings=FATAL,utf8 for just some of them. This is like binmode.
Any other handles opened by should be considered UTF-8 unless declared otherwise; :encoding(UTF-8) or with :utf8 and use feature "unicode_strings" for particular ones of these; export PERL5OPTS=-Mfeature=unicode_strings would work. That makes uc("\xDF") eq "SS" for all of them.
Cover both bases plus all the streams you open with "\xE9" =~ /\w/. See uniquote.
You don’t want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mv5.12. And make sure your input streams are always export PERL5OPTS=-Mcharnames=:full,:short,latin,greekd to Unicode::Normalize, not just to export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD.
Code points between 128–255 should be understood by to be the corresponding Unicode code points, not just unpropertied binary values. eq or ne. That will make lc and cmp. A simple sort or better will also get that.
Named Unicode characters are not by default enabled, so add @a = sort @b or some such. See uninames and tcgrep.
You almost always need access to the functions from the standard @a = Unicode::Collate->new->sort(@b) module various types of decompositions. export PERL5OPTS=-MUnicode::Collate, and then always run incoming stuff through NFD and outbound stuff from NFC. There’s no I/O layer for these yet that I’m aware of, but see nfc, nfd, nfkd, and nfkc.
String comparisons in using printf, write, Unicode::GCString, Unicode::LineBreak, \d+, &c&cc are always wrong. So instead of Unicode::UCD::num, you need a-z. Might as well add that to your A-Z. You can cache the key for binary comparisons.
built-ins like m// and s/// do the wrong thing with Unicode data. You need to use the tr/// module for the former, and both that and also the \p{Lu} module as well for the latter. See uwc and unifmt.
If you want them to count as integers, then you are going to have to run your [A-Za-z] captures through the \p{Upper} function because ’s built-in atoi(3) isn’t currently clever enough.
You are going to have filesystem issues on filesystems. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.
All your code involving \p{Lowercase} or \p{Lower} and such MUST BE CHANGED, including \p{Ll}, \p{Lowercase_Letter}, and [a-zA-Z]. It’s should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.
Code that uses \pL is almost as wrong as code that uses \p{Letter}. You need to use \p{Alphabetic} instead, and know the reason why. Yes, /[\$\@\%]\w+/ and /[\$\@\%]\p{IDS}\p{IDC}*/ are different from \h and \v.
Code that uses \s is even worse. And it can’t use [\h\v] or \n; it needs to use \r\n. Not all alphabetics are letters, you know!
If you are looking for variables with \R, then you have a problem. You need to look for Unicode::Collate->new(level => 1)->cmp($a, $b), and even that isn’t thinking about the punctuation variables or package variables.
If you are checking for whitespace, then you should choose between eq and match, depending. And you should never use substr, since it DOES NOT MEAN Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b), contrary to popular belief.
If you are using Unicode::Collate::->new(level => 1)->eq("d", "ð") for a line boundary, or even Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " ð"), then you are doing it wrong. You have to use eq, which is not the same!
If you don’t know when and whether to call Unicode::Stringprep, then you had better learn.
Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such. The easiest way to do that is with the standard Unicode::Collate module. [aeiou]. There are also (?=[aeiou])\X) methods and such, and you should probably learn about the $/ and lc(uc($s)) eq $s methods, too. These are have distinct advantages over the built-ins.
Sometimes that’s still not enough, and you need the Unicode::Collate::Locale module instead, as in uc(lc($s)) eq $s instead. Consider that uc("?") is true, but uc("?") is false. Similarly, "ae" and "æ" are "?" if you don’t use locales, or if you use the English one, but they are different in the Icelandic locale. Now what? It’s tough, I tell you. You can play with ucsort to test some of these things out.
Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string "niño". Its NFD form — which you had darned well better have remembered to put it in — becomes "nin\x{303}o". Now what are you going to do? Even pretending that a vowel is lc("?") (which is wrong, by the way), you won’t be able to do something like "ª" either, because even in NFD a code point like 'ø’ does not decompose! However, it will test equal to an 'o’ using the UCA comparison I just showed you. You can’t rely on NFD, you have to rely on UCA.
And that’s not all. There are million broken assumptions that people make about Unicode. Until they understand these things, their code will be broken.
Code that assumes it can open a text file without specifying the encoding is broken.
Code that assumes the default encoding is some sort of native platform encoding is broken.
Code that assumes that web pages in Japanese or Chinese take up less space in UTF?16 than in UTF?8 is wrong.
Code that assumes Perl uses UTF?8 internally is wrong.
Code that assumes that encoding errors will always raise an exception is wrong.
Code that assumes Perl code points are limited to 0x10_FFFF is wrong.
Code that assumes you can set "?" to something that will work with any valid line separator is wrong.
Code that assumes roundtrip equality on casefolding, like "?" or \p{Lowercase_Letter}, is completely broken and wrong. Consider that the \p{Letter} and \p{Lowercase} are both \p{Mark}, but \p{Letter} cannot possibly return both of those.
Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, \p{Diacritic} is a lowercase letter with no uppercase; whereas both \p{Mark} and \p{GC=Dash_Punctuation} are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not \p{Dash}, despite being both \p{Mark} and \X.
Code that assumes changing the case doesn’t change the length of the string is broken.
Code that assumes there are only two cases is broken. There’s also titlecase.
Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a \X turning into a \p{Mark}. It can also make it switch from one script to another.
Code that assumes that case is never locale-dependent is broken.
Code that assumes Unicode gives a fig about POSIX locales is broken.
Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.
Code that assumes that diacritics \X and marks \p{Mark} are the same thing is broken.
Code that assumes "\x{FFFF}" covers as much as "\xC0\x80" is broken.
Code that assumes dash, hyphens, and minuses are the same thing as each other, or that there is only one of each, is broken and wrong.
Code that assumes every code point takes up no more than one print column is broken.
Code that assumes that all > characters take up zero print columns is broken.
Code that assumes that characters which look alike are alike is broken.
Code that assumes that characters which do not look alike are not alike is broken.
Code that assumes there is a limit to the number of code points in a row that just one < can match is wrong.
Code that assumes X can never start with a Y character is wrong.
Code that assumes that XY can never hold two non-\p{Math} characters is wrong.
Code that assumes that it cannot use \w is wrong.
Code that assumes a non-BMP code point that requires two UTF-16 (surrogate) code units will encode to two separate UTF-8 characters, one per code unit, is wrong. It doesn’t: it encodes to single code point.
Code that transcodes from UTF?16 or UTF?32 with leading BOMs into UTF?8 is broken if it puts a BOM at the start of the resulting UTF-8. This is so stupid the engineer should have their eyelids removed.
Code that assumes the CESU-8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as ^ is UTF-8 is broken and wrong. These guys also deserve the eyelid treatment.
Code that assumes characters like ~ always points to the right and ü always points to the left are wrong — because they in fact do not.
Code that assumes if you first output character ? and then character \p{InLatin}, that those will show up as \p{Latin} is wrong. Sometimes they don’t.
Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong. Off with their heads! If that seems too extreme, we can compromise: henceforth they may type only with their big toe from one foot (the rest still be ducktaped).
Code that assumes that all \p{InLatin} code points are visible characters is wrong.
Code that assumes $FIRST_LETTER contains only letters, digits, and underscores is wrong.
Code that assumes that $LAST_LETTER and [${FIRST_LETTER}-${LAST_LETTER}] are punctuation marks is wrong.
Code that assumes that ? has an umlaut is wrong.
Code that believes things like printf contain any letters in them is wrong.
Code that believes ls is the same as readdir is heinously broken.
Code that believe that /s/i is almost ever useful is almost certainly wrong.
Code that believes that given "S" as the first letter in some alphabet and "s" as the last letter in that same alphabet, that \PM\pM* has any meaning whatsoever is almost always complete broken and wrong and meaningless.
Code that believes someone’s name can only contain certain characters is stupid, offensive, and wrong.
Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again. Period. I’m not even positive they should even be allowed to see again, since it obviously hasn’t done them much good so far.
Code that believes there’s some way to pretend textfile encodings don’t exist is broken and dangerous. Might as well poke the other eye out, too.
Code that converts unknown characters to \X is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.
Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal mélange of hubris and naïveté that only a lightning bolt from Zeus will fix.
Code that believes you can use Unicode::Collate widths to pad and justify Unicode data is broken and wrong.
Code that believes once you successfully create a file by a given name, that when you run Unicode::LineBreak or "\x{F5}" on its enclosing directory, you’ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.
Code that treats code points from one plane one whit differently than those from any other plane is ipso facto broken and wrong. Go back to school.
Code that believes that stuff like "o\x{303}" can only match "o\x{303}\x{304}" or "o\x{304}\x{303}" is broken and wrong. You’d be surprised.
Code that uses PERL_UNICODE to find grapheme clusters instead of using "SA" is broken and wrong.
People who want to go back to the ASCII world should be whole-heartedly encouraged to do so, and in honor of their glorious upgrade they should be provided gratis with a pre-electric manual typewriter for all their data-entry needs. Messages sent to them should be send via an ??????s telegraph at 40 characters per line and hand-delivered by a courier. STOP.
My own boilerplate these days tends to look like this:
use warnings;
use warnings qw( FATAL utf8 );
Run Code Online (Sandbox Code Playgroud)
I don’t know how much more "default Unicode in " you can get than what I’ve written. Well, yes I do: you should be using PERL_UNICODE and __C
jro*_*way 95
处理Unicode文本有两个阶段.第一个是"如何输入并输出它而不会丢失信息".第二个是"我如何根据当地语言惯例处理文本".
tchrist的帖子涵盖了两者,但第二部分是他的帖子中99%的文本来自.大多数程序甚至不能正确处理I/O,因此在您开始担心规范化和整理之前理解这一点非常重要.
这篇文章旨在解决第一个问题
当您将数据读入Perl时,它并不关心它是什么编码.它分配一些内存并将字节存放在那里.如果你说print $str,它只是将那些字节闪存到你的终端,这可能会设置为假设写入它的所有内容都是UTF-8,并且你的文本显示出来.
奇妙.
除此之外,事实并非如此.如果您尝试将数据视为文本,您会看到发生了一些不好的事情.你只需length看看Perl对你的字符串的看法,以及你对你的字符串不一致的看法.写一个单行像:perl -E 'while(<>){ chomp; say length }'并输入????,你得到12 ...不正确的答案,4.
那是因为Perl认为你的字符串不是文本.你必须告诉它它之前的文本会给你正确的答案.
这很容易; Encode模块具有执行此操作的功能.通用入口点是Encode::decode(或use Encode qw(decode)当然).该函数从外部世界获取一些字符串(我们称之为"八位字节",一种说"8位字节"的方式),并将其转换为Perl将理解的一些文本.第一个参数是字符编码名称,如"UTF-8"或"ASCII"或"EUC-JP".第二个参数是字符串.返回值是包含文本的Perl标量.
(还有Encode::decode_utf8,假设编码为UTF-8.)
如果我们重写我们的单行:
perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'
Run Code Online (Sandbox Code Playgroud)
我们输入文字化け并得到"4"作为结果.成功.
那就是Perl中99%的Unicode问题的解决方案.
关键是,无论何时任何文本进入您的程序,您都必须对其进行解码.Internet无法传输字符.文件无法存储字符.您的数据库中没有字符.只有八位字节,你不能将八位字节视为Perl中的字符.您必须使用Encode模块将编码的八位字节解码为Perl字符.
问题的另一半是从程序中获取数据.这很容易; 你只是说use Encode qw(encode),决定你的数据编码是什么(UTF-8到了解UTF-8的终端,UTF-16用于Windows上的文件等),然后输出结果encode($encoding, $data)而不是输出$data.
此操作将Perl的字符转换为可由外部使用的八位字节,这是您的程序操作的字符.如果我们可以通过互联网或终端发送字符会容易得多,但我们不能:仅限八位字节.所以我们必须将字符转换为八位字节,否则结果是未定义的.
总结一下:编码所有输出并解码所有输入.
现在我们将讨论三个使这有点挑战的问题.首先是图书馆.他们是否正确处理文本?答案是......他们试试.如果您下载网页,LWP将以文本形式返回您的结果.如果调用正确的方法的结果,那就是(且恰好是decoded_content,不content,这只是一个字节流,它从服务器得到.)数据库驱动程序可呈片状; 如果你只使用带有Perl的DBD :: SQLite,它会解决,但是如果其他一些工具将文本存储为数据库中除UTF-8之外的某些编码......那么...它将无法正确处理直到你编写代码来正确处理它.
输出数据通常更容易,但是如果你看到"打印中的宽字符",那么你就知道你在某处弄乱了编码.这个警告意味着"嘿,你试图将Perl角色泄露给外界并且没有任何意义".您的程序似乎正常工作(因为另一端通常正确处理原始Perl字符),但它非常破碎,可能随时停止工作.用明确的方法修复它Encode::encode!
第二个问题是UTF-8编码的源代码.除非您use utf8在每个文件的顶部说,Perl不会假设您的源代码是UTF-8.这意味着每次你说出类似的话my $var = '??',你都会在你的程序中注入垃圾,这将完全打破所有可怕的事情.你不必为"使用UTF8",但如果你不这样做,你一定不能在程序中使用任何非ASCII字符.
第三个问题是Perl如何处理过去.很久以前,没有Unicode这样的东西,Perl认为一切都是Latin-1文本或二进制文件.因此,当数据进入您的程序并开始将其视为文本时,Perl会将每个八位字节视为Latin-1字符.这就是为什么当我们要求"文字化け"的长度时,我们得到了12. Perl假设我们使用的是Latin-1字符串"æååã"(这是12个字符,其中一些是非打印字符).
这被称为"隐式升级",这是一个非常合理的事情,但如果您的文本不是Latin-1,那么它不是您想要的.这就是明确解码输入的关键所在:如果你不这样做,Perl就会这样做,而且它可能会做错.
人们遇到麻烦,其中一半的数据是正确的字符串,有些仍然是二进制的.Perl将解释仍然是二进制的部分,就好像它是Latin-1文本,然后将它与正确的字符数据结合起来.这将使它看起来像处理你的角色正确地破坏你的程序,但实际上,你只是没有足够的修复它.
这是一个例子:你有一个程序读取一个UTF-8编码的文本文件,你将Unicode添加PILE OF POO到每一行,然后打印出来.你写它像:
while(<>){
chomp;
say "$_ ";
}
Run Code Online (Sandbox Code Playgroud)
然后运行一些UTF-8编码数据,例如:
perl poo.pl input-data.txt
Run Code Online (Sandbox Code Playgroud)
它在每行末尾用便便打印UTF-8数据.完美,我的程序有效!
但不,你只是做二进制连接.您正在从文件中读取八位字节,删除\n带有chomp的八位字节,然后添加字符的UTF-8表示中的PILE OF POO字节.当您修改程序以解码文件中的数据并对输出进行编码时,您会注意到垃圾("ð©")而不是poo.这将使您相信解码输入文件是错误的.不是.
问题是,poo被隐式升级为latin-1.如果use utf8要创建文字文本而不是二进制文件,那么它将再次起作用!
(这是我在帮助人们使用Unicode时遇到的头号问题.他们确实做到了正确而且打破了他们的计划.这就是未定义的结果令人遗憾:你可以长时间使用一个工作程序,但是当你开始修复它时,不要担心;如果你要将编码/解码语句添加到你的程序中它会中断,这只意味着你还有更多工作要做.下一次,当你从头开始设计Unicode时,它将是更容易!)
这就是你需要了解的关于Perl和Unicode的所有信息.如果您告诉Perl您的数据是什么,它在所有流行的编程语言中都具有最佳的Unicode支持.如果您认为它会神奇地知道您正在提供哪种文本,那么您将不可撤销地丢弃您的数据.仅仅因为您的程序今天在您的UTF-8终端上运行并不意味着它将在明天的UTF-16编码文件上运行.所以现在就把它安全吧,让自己省去丢弃用户数据的麻烦!
处理Unicode的简单部分是编码输出和解码输入.困难的部分是找到所有输入和输出,并确定它是什么编码.但这就是为什么你得到了大笔钱:)
小智 48
我们都同意这是一个困难的问题,原因很多,但这正是试图让每个人都更容易理解的原因.
CPAN上有一个最近的模块,即utf8 :: all,它试图"打开Unicode.所有这些".
正如已经指出的那样,你不能神奇地使整个系统(外部程序,外部Web请求等)也使用Unicode,但我们可以合作制作合理的工具,使常见问题更容易.这就是我们是程序员的原因.
如果utf8 :: all没有做你认为应该做的事情,让我们改进它以使其更好.或者让我们制作其他工具,这些工具可以一起满足人们不同的需求.
`
bri*_*foy 34
我认为你误解了Unicode及其与Perl的关系.无论您以何种方式存储数据,Unicode,ISO-8859-1或许多其他内容,您的程序都必须知道如何解释它作为输入(解码)获得的字节以及如何表示它想要输出的信息(编码) ).得到错误的解释,你搞砸了数据.你的程序中没有一些神奇的默认设置可以告诉你程序之外的东西如何行动.
你认为这很难,很可能,因为你习惯了所有的ASCII.您应该考虑的所有内容都被编程语言和它必须与之交互的所有内容所忽略.如果一切都只使用UTF-8而你别无选择,那么UTF-8也会如此简单.但并非所有东西都使用UTF-8.例如,你不希望你的输入句柄认为它正在获得UTF-8八位字节,除非它实际上是,并且如果从它们读取的东西可以处理UTF-8你不希望你的输出句柄是UTF-8 .Perl无法知道这些事情.这就是你是程序员的原因.
我不认为Perl 5中的Unicode太复杂了.我认为这很可怕,人们会避免它.有区别.为此,我将Unicode放入Learning Perl,第6版,并且在Effective Perl Programming中有很多Unicode内容.您必须花时间学习和理解Unicode及其工作原理.否则你无法有效地使用它.
小智 28
在阅读这个主题时,我经常会觉得人们使用" UTF-8 "作为" Unicode " 的同义词.请区分Unicode的"代码点",它是ASCII代码的放大相对和Unicode的各种"编码".其中有一些,其中UTF-8,UTF-16和UTF-32是现有的,还有一些是过时的.
请UTF-8(以及所有其他编码)存在并且仅在输入或输出中具有意义.在内部,从Perl 5.8.1开始,所有字符串都保存为Unicode"Code-points".没错,你必须启用一些以前令人敬佩的功能.
gee*_*aur 10
在野外有一个真正令人恐怖的古代代码,其中大部分都是常见的CPAN模块.我发现我已经是相当小心使Unicode的,如果我使用,可能受其影响的外部模块,并且现在还在中我经常使用(特别是一些Perl脚本来识别和修复一些Unicode相关故障,iTiVo失败严重因为转码问题而导致的任何非7位ASCII的事情.
| 归档时间: |
|
| 查看次数: |
98130 次 |
| 最近记录: |