在Perl中,如何用简单的ASCII字符替换UTF8字符,例如\ x91,\ x {2018},\ x {2013},\ x {2014}?

bod*_*ydo 1 perl encoding ascii character-encoding non-ascii-characters

我正在处理各种文章,我遇到的问题是各种作者使用各种字符作为标点字符.

例如,我目前使用的几个文档具有以下字符:

\x91
\x92
\x{2018}
\x{2019}
Run Code Online (Sandbox Code Playgroud)

所有这些字符代表一个简单的引用'.

我想要做的是简化文章,使它们都具有相同的格式样式.

有没有人知道将这些字符和类似字符(如双引号,破折号等)转换为简单ASCII字符的模块或方法?

我目前正在做的事情如下:

sub fix_chars_in_document {
    my $document = shift;
    $document =~ s/\xa0/ /g;
    $document =~ s/\x91/'/g;
    $document =~ s/\x92/'/g;
    $document =~ s/\x93/"/g;
    $document =~ s/\x94/"/g;
    $document =~ s/\x97/-/g;
    $document =~ s/\xab/"/g;
    $document =~ s/\xa9//g;
    $document =~ s/\xae//g;
    $document =~ s/\x{2018}/'/g;
    $document =~ s/\x{2019}/'/g;
    $document =~ s/\x{201C}/"/g;
    $document =~ s/\x{201D}/"/g;
    $document =~ s/\x{2022}//g;
    $document =~ s/\x{2013}/-/g;
    $document =~ s/\x{2014}/-/g;
    $document =~ s/\x{2122}//g; 
    return $document ;
}
Run Code Online (Sandbox Code Playgroud)

但这很难,因为我要手动找到字符并替换它们.

ike*_*ami 7

首先,您的解决方案将受益于哈希.

my %asciify = (
   chr(0x00A0) => ' ',
   chr(0x0091) => "'",
   chr(0x0092) => "'",
   chr(0x0093) => '"',
   chr(0x0094) => '"',
   chr(0x0097) => '-',
   chr(0x00AB) => '"',
   chr(0x00A9) => '/',
   chr(0x00AE) => '/',
   chr(0x2018) => "'",
   chr(0x2019) => "'",
   chr(0x201C) => '"',
   chr(0x201D) => '"',
   chr(0x2022) => '/',
   chr(0x2013) => '-',
   chr(0x2014) => '-',
   chr(0x2122) => '/',
);

my $pat = join '', map quotemeta, keys %asciify;
my $re = qr/[$pat]/;

sub fix_chars {
    my ($s) = @_;
    $s =~ s/($re)/$asciifi{$1}/g;
    return $s;
}
Run Code Online (Sandbox Code Playgroud)

也就是说,你想要Text :: Unidecode.


只是标点字符:

use Text::Unidecode qw( unidecode );
s/(\p{Punct}+)/ unidecode($1) /eg;
Run Code Online (Sandbox Code Playgroud)