在Perl中测试查询字符串unicode处理

Ovi*_*vid 3 testing unicode perl query-string

我正在尝试编写一个测试查询字符串解析的例子,当我遇到Unicode问题时.简而言之,字母"Omega"(Ω)似乎没有被正确解码.

  • Unicode:U + 2126
  • 3字节序列:\ xe2\x84\xa6
  • URI编码:%E2%84%A6

所以我写了这个测试程序验证我可以用URI :: Encode"解码"unicode查询字符串.

use strict;                                                                                                                                                                    
use warnings;
use utf8::all;    # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;

sub parse_query_string {
    my $query_string = shift;
    my @pairs = split /[&;]/ => $query_string;

    my %values_for;
    foreach my $pair (@pairs) {
        my ( $key, $value ) = split( /=/, $pair );
        $_ = uri_decode($_) for $key, $value;
        $values_for{$key} ||= [];
        push @{ $values_for{$key} } => $value;
    }
    return \%values_for;
}

my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';

diag $omega;
diag $query->{alpha}[0];

done_testing;
Run Code Online (Sandbox Code Playgroud)

并且测试的输出:

query.t .. 
not ok 1 - Unicode should decode correctly
#   Failed test 'Unicode should decode correctly'
#   at query.t line 23.
#     Structures begin differing at:
#          $got->{alpha}[0] = 'â¦'
#     $expected->{alpha}[0] = '?'
# ?
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=1,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.05 cusr  0.00 csys =  0.09 CPU)
Result: FAIL
Run Code Online (Sandbox Code Playgroud)

在我看来,URI :: Encode可能会在这里被破坏,但切换到URI :: Escape并使用uri_unescape函数报告相同的错误.我错过了什么?

miy*_*awa 7

URI编码的字符只表示utf-8序列,而URI :: Encode和URI :: Escape只是将它们解码为utf-8字节字符串,它们都不会将字节串解码为UTF-8(这是正确的行为通用URI解码库).

换句话说,你的代码基本上是这样做的: is "\N{U+2126}", "\xe2\x84\xa6"那将失败,因为相比之下,perl将后者升级为3个字符长度的latin-1字符串.

您必须使用Encode::decode_utf8after 手动解码输入值uri_decode,或者改为比较编码的utf8字节序列.


ilm*_*ari 5

URI转义表示八位字节,对字符编码一无所知,因此您必须自己从UTF-8八位字节解码为字符,例如:

$_ = decode_utf8(uri_decode($_)) for $key, $value;
Run Code Online (Sandbox Code Playgroud)