Fro*_*y Z 5 php utf-8 character-encoding mbstring
我想用引号替换无效的UTF-8字符(PHP 5.3.5).
到目前为止,我有这个解决方案,但删除了无效字符,而不是被'?'取代.
function replace_invalid_utf8($str)
{
return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
echo mb_substitute_character()."\n";
echo replace_invalid_utf8('éééaaaàààeeé')."\n";
echo replace_invalid_utf8('eeeaaaaaaeeé')."\n";
Run Code Online (Sandbox Code Playgroud)
应输出:
63 // ASCII code for '?' character
???aaa???eé // or ??aa??eé
eeeaaaaaaeeé
Run Code Online (Sandbox Code Playgroud)
但目前产出:
63
aaaee // removed invalid characters
eeeaaaaaaeeé
Run Code Online (Sandbox Code Playgroud)
有什么建议?
你会用另一种方式(preg_replace()例如使用?)
谢谢.
mas*_*tic 32
从PHP 5.4开始,您可以使用mb_convert_encoding()或htmlspecialchars()的ENT_SUBSTITUTE选项.对于cource,你也可以使用preg_match().如果使用intl,则可以从PHP 5.5开始使用UConverter.
无效字节序列的推荐替代字符是U + FFFD.有关详细信息,请参阅UTR#36中的" 3.1.2替换不正确的子序列 ":Unicode安全注意事项.
使用mb_convert_encoding()时,可以通过将Unicode代码点传递给mb_substitute_character()或mbstring.substitute_character指令来指定替换字符.替换的默认字符是?(问号 - U + 003F).
// REPLACEMENT CHARACTER (U+FFFD)
mb_substitute_character(0xFFFD);
function replace_invalid_byte_sequence($str)
{
return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
function replace_invalid_byte_sequence2($str)
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}
Run Code Online (Sandbox Code Playgroud)
UConverter提供了面向对象的API和面向对象的API.
function replace_invalid_byte_sequence3($str)
{
return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}
function replace_invalid_byte_sequence4($str)
{
return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}
Run Code Online (Sandbox Code Playgroud)
使用preg_match()时,需要注意字节范围,以避免UTF-8非最短形式的漏洞.跟踪字节的范围根据前导字节的范围而变化.
lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)
Run Code Online (Sandbox Code Playgroud)
您可以参考以下资源来检查字节范围.
字节范围表如下.
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000 - U+007F 00 - 7F
U+0080 - U+07FF C2 - DF 80 - BF
U+0800 - U+0FFF E0 A0 - BF 80 - BF
U+1000 - U+CFFF E1 - EC 80 - BF 80 - BF
U+D000 - U+D7FF ED 80 - 9F 80 - BF
U+E000 - U+FFFF EE - EF 80 - BF 80 - BF
U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF
U+40000 - U+FFFFF F1 - F3 80 - BF 80 - BF 80 - BF
U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF
Run Code Online (Sandbox Code Playgroud)
如何在不破坏有效字符的情况下替换无效字节序列,请参见UTR#36 中的" 3.1.1 Ill-Formed Subsequences ":Unicode安全注意事项和" 表3-8.在UTF-8转换中使用U + FFFD "中的Unicode标准.
Unicode标准显示了一个示例:
before: <61 F1 80 80 E1 80 C2 62 80 63 80 BF 64 >
after: <0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064>
Run Code Online (Sandbox Code Playgroud)
以下是preg_replace_callback()根据上述规则的实现.
function replace_invalid_byte_sequence5($str)
{
// REPLACEMENT CHARACTER (U+FFFD)
$substitute = "\xEF\xBF\xBD";
$regex = '/
([\x00-\x7F] # U+0000 - U+007F
|[\xC2-\xDF][\x80-\xBF] # U+0080 - U+07FF
| \xE0[\xA0-\xBF][\x80-\xBF] # U+0800 - U+0FFF
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # U+1000 - U+CFFF
| \xED[\x80-\x9F][\x80-\xBF] # U+D000 - U+D7FF
| \xF0[\x90-\xBF][\x80-\xBF]{2} # U+10000 - U+3FFFF
|[\xF1-\xF3][\x80-\xBF]{3} # U+40000 - U+FFFFF
| \xF4[\x80-\x8F][\x80-\xBF]{2}) # U+100000 - U+10FFFF
|(\xE0[\xA0-\xBF] # U+0800 - U+0FFF (invalid)
|[\xE1-\xEC\xEE\xEF][\x80-\xBF] # U+1000 - U+CFFF (invalid)
| \xED[\x80-\x9F] # U+D000 - U+D7FF (invalid)
| \xF0[\x90-\xBF][\x80-\xBF]? # U+10000 - U+3FFFF (invalid)
|[\xF1-\xF3][\x80-\xBF]{1,2} # U+40000 - U+FFFFF (invalid)
| \xF4[\x80-\x8F][\x80-\xBF]?) # U+100000 - U+10FFFF (invalid)
|(.) # invalid 1-byte
/xs';
// $matches[1]: valid character
// $matches[2]: invalid 3-byte or 4-byte character
// $matches[3]: invalid 1-byte
$ret = preg_replace_callback($regex, function($matches) use($substitute) {
if (isset($matches[2]) || isset($matches[3])) {
return $substitute;
}
return $matches[1];
}, $str);
return $ret;
}
Run Code Online (Sandbox Code Playgroud)
您可以直接比较字节,并通过这种方式避免preg_match对字节大小的限制.
function replace_invalid_byte_sequence6($str) {
$size = strlen($str);
$substitute = "\xEF\xBF\xBD";
$ret = '';
$pos = 0;
$char;
$char_size;
$valid;
while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
$ret .= $valid ? $char : $substitute;
}
return $ret;
}
function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
{
$valid = false;
if ($str_size <= $pos) {
return false;
}
if ($str[$pos] < "\x80") {
$valid = true;
$char_size = 1;
} else if ($str[$pos] < "\xC2") {
$char_size = 1;
} else if ($str[$pos] < "\xE0") {
if (!isset($str[$pos+1]) || $str[$pos+1] < "\x80" || "\xBF" < $str[$pos+1]) {
$char_size = 1;
} else {
$valid = true;
$char_size = 2;
}
} else if ($str[$pos] < "\xF0") {
$left = "\xE0" === $str[$pos] ? "\xA0" : "\x80";
$right = "\xED" === $str[$pos] ? "\x9F" : "\xBF";
if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {
$char_size = 1;
} else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {
$char_size = 2;
} else {
$valid = true;
$char_size = 3;
}
} else if ($str[$pos] < "\xF5") {
$left = "\xF0" === $str[$pos] ? "\x90" : "\x80";
$right = "\xF4" === $str[$pos] ? "\x8F" : "\xBF";
if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {
$char_size = 1;
} else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {
$char_size = 2;
} else if (!isset($str[$pos+3]) || $str[$pos+3] < "\x80" || "\xBF" < $str[$pos+3]) {
$char_size = 3;
} else {
$valid = true;
$char_size = 4;
}
} else {
$char_size = 1;
}
$char = substr($str, $pos, $char_size);
$pos += $char_size;
return true;
}
Run Code Online (Sandbox Code Playgroud)
测试用例在这里.
function run(array $callables, array $arguments)
{
return array_map(function($callable) use($arguments) {
return array_map($callable, $arguments);
}, $callables);
}
$data = [
// Table 3-8. Use of U+FFFD in UTF-8 Conversion
// http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
"\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63"
."\x80"."\xBF"."\x64",
// 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
"\xF0\x9F\x8C\x95"."\xF0\x9F\x8C"."\xF0\x9F\x8C"
];
var_dump(run([
'replace_invalid_byte_sequence',
'replace_invalid_byte_sequence2',
'replace_invalid_byte_sequence3',
'replace_invalid_byte_sequence4',
'replace_invalid_byte_sequence5',
'replace_invalid_byte_sequence6'
], $data));
Run Code Online (Sandbox Code Playgroud)
需要注意的是,mb_convert_encoding有一个错误,它会在无效字节序列之后中断有效字符,或者在有效字符之后删除无效字节序列而不添加U + FFFD.
$data = [
// U+20AC
"\xE2\x82\xAC"."\xE2\x82\xAC"."\xE2\x82\xAC",
"\xE2\x82" ."\xE2\x82\xAC"."\xE2\x82\xAC",
// U+24B62
"\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
"\xF0\xA4\xAD" ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
"\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
// 'FULL MOON SYMBOL' (U+1F315)
"\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C",
"\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C" . "\xF0\x9F\x8C"
];
Run Code Online (Sandbox Code Playgroud)
虽然preg_match()可以使用preg_replace_callback,但是这个函数对bytesize有一个限制.有关详细信息,请参阅错误报告#36463.您可以通过以下测试用例进行确认.
str_repeat('a', 10000)
Run Code Online (Sandbox Code Playgroud)
最后,我的基准测试结果如下.
mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter::transcode()
0.15999984741211
UConverter::convert()
0.29843020439148
preg_replace_callback()
0.63967490196228
direct comparision
0.71933102607727
Run Code Online (Sandbox Code Playgroud)
基准代码在这里.
function timer(array $callables, array $arguments, $repeat = 10000)
{
$ret = [];
$save = $repeat;
foreach ($callables as $key => $callable) {
$start = microtime(true);
do {
array_map($callable, $arguments);
} while($repeat -= 1);
$stop = microtime(true);
$ret[$key] = $stop - $start;
$repeat = $save;
}
return $ret;
}
$functions = [
'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
'UConverter::convert()' => 'replace_invalid_byte_sequence4',
'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
'direct comparision' => 'replace_invalid_byte_sequence6'
];
foreach (timer($functions, $data) as $description => $time) {
echo $description, PHP_EOL,
$time, PHP_EOL;
}
Run Code Online (Sandbox Code Playgroud)