如何在PHP中迭代UTF-8字符串?

czu*_*zuk 41 php utf-8

如何使用索引逐字符迭代UTF-8字符串?

当您使用括号运算符访问UTF-8字符串时,$str[0]utf编码的字符由2个或更多元素组成.

例如:

$str = "K?t";
$str[0] = "K";
$str[1] = "?";
$str[2] = "?";
$str[3] = "t";
Run Code Online (Sandbox Code Playgroud)

但我希望:

$str[0] = "K";
$str[1] = "?";
$str[2] = "t";
Run Code Online (Sandbox Code Playgroud)

这是可能的,mb_substr但这是非常缓慢的,即.

mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "?"
mb_substr($str, 2, 1) = "t"
Run Code Online (Sandbox Code Playgroud)

是否有另一种方法来逐字符串字符串而不使用mb_substr

var*_*tec 58

使用preg_split.使用"u"修饰符,它支持UTF-8 unicode.

$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Run Code Online (Sandbox Code Playgroud)

  • 更重要的是,我已经测试了超过1000个"长"文档,速度提高了40倍:-)(参见我的回答). (6认同)
  • @Pekka可能是.使用`mb_substr`是字符串长度的二次方; 即使存在构建数组的开销,这也是线性的.当然,它比你的方法需要更多的内存. (4认同)
  • 我刚试过它.对于长度为100个字符的字符串,preg_split为50%*更快*. (4认同)
  • 这非常优雅,但我很难想象*比`mb_substr()`更快*. (3认同)

Laj*_*ros 34

Preg split会因为内存异常失败超大字符串而mb_substr确实很慢,所以这里有一个简单而有效的代码,我敢肯定,你可以使用:

function nextchar($string, &$pointer){
    if(!isset($string[$pointer])) return false;
    $char = ord($string[$pointer]);
    if($char < 128){
        return $string[$pointer++];
    }else{
        if($char < 224){
            $bytes = 2;
        }elseif($char < 240){
            $bytes = 3;
        }else{
            $bytes = 4;
        }
        $str =  substr($string, $pointer, $bytes);
        $pointer += $bytes;
        return $str;
    }
}
Run Code Online (Sandbox Code Playgroud)

这用于循环遍历char的多字节字符串char,如果我将其更改为下面的代码,性能差异很大:

function nextchar($string, &$pointer){
    if(!isset($string[$pointer])) return false;
    return mb_substr($string, $pointer++, 1, 'UTF-8');
}
Run Code Online (Sandbox Code Playgroud)

使用它使用下面的代码循环一个字符串10000次,第一个代码产生3秒的运行时间,第二个代码产生13秒的运行时间:

function microtime_float(){
    list($usec, $sec) = explode(' ', microtime());
    return ((float)$usec + (float)$sec);
}

$source = 'árvízt?r? tükörfúrógépárvízt?r? tükörfúrógépárvízt?r? tükörfúrógépárvízt?r? tükörfúrógépárvízt?r? tükörfúrógép';

$t = Array(
    0 => microtime_float()
);

for($i = 0; $i < 10000; $i++){
    $pointer = 0;
    while(($chr = nextchar($source, $pointer)) !== false){
        //echo $chr;
    }
}

$t[] = microtime_float();

echo $t[1] - $t[0].PHP_EOL.PHP_EOL;
Run Code Online (Sandbox Code Playgroud)

  • elseif($ char = 252){应该是elseif($ char == 252){ (3认同)

czu*_*zuk 22

回答@Pekla和@Col发表的评论.我preg_split与之相比的弹片mb_substr.

替代文字

图像显示,preg_split耗时1.2 mb_substr秒,而差不多25秒.

这是函数的代码:

function split_preg($str){
    return preg_split('//u', $str, -1);     
}

function split_mb($str){
    $length = mb_strlen($str);
    $chars = array();
    for ($i=0; $i<$length; $i++){
        $chars[] = mb_substr($str, $i, 1);
    }
    $chars[] = "";
    return $chars;
}
Run Code Online (Sandbox Code Playgroud)


Chr*_*asr 9

使用Lajos Meszaros的精彩功能作为灵感,我创建了一个多字节字符串迭代器类.

// Multi-Byte String iterator class
class MbStrIterator implements Iterator
{
    private $iPos   = 0;
    private $iSize  = 0;
    private $sStr   = null;

    // Constructor
    public function __construct(/*string*/ $str)
    {
        // Save the string
        $this->sStr     = $str;

        // Calculate the size of the current character
        $this->calculateSize();
    }

    // Calculate size
    private function calculateSize() {

        // If we're done already
        if(!isset($this->sStr[$this->iPos])) {
            return;
        }

        // Get the character at the current position
        $iChar  = ord($this->sStr[$this->iPos]);

        // If it's a single byte, set it to one
        if($iChar < 128) {
            $this->iSize    = 1;
        }

        // Else, it's multi-byte
        else {

            // Figure out how long it is
            if($iChar < 224) {
                $this->iSize = 2;
            } else if($iChar < 240){
                $this->iSize = 3;
            } else if($iChar < 248){
                $this->iSize = 4;
            } else if($iChar == 252){
                $this->iSize = 5;
            } else {
                $this->iSize = 6;
            }
        }
    }

    // Current
    public function current() {

        // If we're done
        if(!isset($this->sStr[$this->iPos])) {
            return false;
        }

        // Else if we have one byte
        else if($this->iSize == 1) {
            return $this->sStr[$this->iPos];
        }

        // Else, it's multi-byte
        else {
            return substr($this->sStr, $this->iPos, $this->iSize);
        }
    }

    // Key
    public function key()
    {
        // Return the current position
        return $this->iPos;
    }

    // Next
    public function next()
    {
        // Increment the position by the current size and then recalculate
        $this->iPos += $this->iSize;
        $this->calculateSize();
    }

    // Rewind
    public function rewind()
    {
        // Reset the position and size
        $this->iPos     = 0;
        $this->calculateSize();
    }

    // Valid
    public function valid()
    {
        // Return if the current position is valid
        return isset($this->sStr[$this->iPos]);
    }
}
Run Code Online (Sandbox Code Playgroud)

它可以像这样使用

foreach(new MbStrIterator("K?t") as $c) {
    echo "{$c}\n";
}
Run Code Online (Sandbox Code Playgroud)

哪个会输出

K
?
t
Run Code Online (Sandbox Code Playgroud)

或者如果你真的想知道起始字节的位置

foreach(new MbStrIterator("K?t") as $i => $c) {
    echo "{$i}: {$c}\n";
}
Run Code Online (Sandbox Code Playgroud)

哪个会输出

0: K
1: ?
3: t
Run Code Online (Sandbox Code Playgroud)


Pek*_*ica 6

您可以解析字符串的每个字节并确定它是单个 (ASCII) 字符还是多字节字符的开头

\n\n
\n

UTF-8 编码是可变宽度的,每个字符由 1 到 4 个字节表示。每个字节都有 0\xe2\x80\x934 开头的连续“1”位,后跟一个“0”位来指示其类型。2 个或更多“1”位表示那么多字节序列中的第一个字节。

\n
\n\n

您将遍历该字符串,而不是将位置增加 1,而是完整读取当前字符,然后将位置增加该字符的长度。

\n\n

维基百科文章有每个字符的解释表[检索于2010-10-01]

\n\n
   0-127 Single-byte encoding (compatible with US-ASCII)\n 128-191 Second, third, or fourth byte of a multi-byte sequence\n 192-193 Overlong encoding: start of 2-byte sequence, \n         but would encode a code point \xe2\x89\xa4 127\n  ........\n
Run Code Online (Sandbox Code Playgroud)\n