JW.*_*JW. 33 php unicode pcre utf-8
我正在尝试使用preg_match搜索UTF8编码的字符串.
preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];
Run Code Online (Sandbox Code Playgroud)
这应该打印1,因为"H"在字符串"¡Hola!"中的索引1处.但它打印2.所以它似乎并没有将主题视为UTF8编码的字符串,即使我在正则表达式中传递"u" 修饰符.
我在php.ini中有以下设置,其他UTF8函数正在运行:
mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?
Gum*_*mbo 39
尽管u修饰符使得模式和主题都被解释为UTF-8,但捕获的偏移量仍以字节为单位计算.
您可以使用mb_strlen以UTF-8字符而不是字节来获取长度:
$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));
Run Code Online (Sandbox Code Playgroud)
Nat*_*xet 25
尝试在正则表达式之前添加此(*UTF8):
preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
Run Code Online (Sandbox Code Playgroud)
魔术,感谢http://www.php.net/manual/es/function.preg-match.php#95828中的评论 .
use*_*291 20
看起来这是一个"功能",请参阅 http://bugs.php.net/bug.php?id=37391
'u'开关只对pcre有意义,PHP本身并不知道它.
从PHP的角度来看,字符串是字节序列,返回字节偏移似乎是合乎逻辑的(我不说"正确").
请原谅我,但可能有人会觉得它很有用:下面的代码既可以作为preg_match和preg_match_all函数的替代,也可以返回UTF8编码字符串的正确偏移的正确匹配.
mb_internal_encoding('UTF-8');
/**
* Returns array of matches in same format as preg_match or preg_match_all
* @param bool $matchAll If true, execute preg_match_all, otherwise preg_match
* @param string $pattern The pattern to search for, as a string.
* @param string $subject The input string.
* @param int $offset The place from which to start the search (in bytes).
* @return array
*/
function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
{
$matchInfo = array();
$method = 'preg_match';
$flag = PREG_OFFSET_CAPTURE;
if ($matchAll) {
$method .= '_all';
}
$n = $method($pattern, $subject, $matchInfo, $flag, $offset);
$result = array();
if ($n !== 0 && !empty($matchInfo)) {
if (!$matchAll) {
$matchInfo = array($matchInfo);
}
foreach ($matchInfo as $matches) {
$positions = array();
foreach ($matches as $match) {
$matchedText = $match[0];
$matchedLength = $match[1];
$positions[] = array(
$matchedText,
mb_strlen(mb_strcut($subject, 0, $matchedLength))
);
}
$result[] = $positions;
}
if (!$matchAll) {
$result = $result[0];
}
}
return $result;
}
$s1 = '????????? ??????? ?????? ??? ?????';
$s2 = 'Try english string for test';
var_dump(pregMatchCapture(true, '/???/', $s1));
var_dump(pregMatchCapture(false, '/???/', $s1));
var_dump(pregMatchCapture(true, '/lish/', $s2));
var_dump(pregMatchCapture(false, '/lish/', $s2));
Run Code Online (Sandbox Code Playgroud)
我的例子输出:
array(1) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(6) "???"
[1]=>
int(4)
}
}
}
array(1) {
[0]=>
array(2) {
[0]=>
string(6) "???"
[1]=>
int(4)
}
}
array(1) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "lish"
[1]=>
int(7)
}
}
}
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "lish"
[1]=>
int(7)
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
60520 次 |
| 最近记录: |