如何使用索引逐字符迭代UTF-8字符串?
当您使用括号运算符访问UTF-8字符串时,$str[0]
utf编码的字符由2个或更多元素组成.
例如:
$str = "K?t";
$str[0] = "K";
$str[1] = "?";
$str[2] = "?";
$str[3] = "t";
Run Code Online (Sandbox Code Playgroud)
但我希望:
$str[0] = "K";
$str[1] = "?";
$str[2] = "t";
Run Code Online (Sandbox Code Playgroud)
这是可能的,mb_substr
但这是非常缓慢的,即.
mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "?"
mb_substr($str, 2, 1) = "t"
Run Code Online (Sandbox Code Playgroud)
是否有另一种方法来逐字符串字符串而不使用mb_substr
?
var*_*tec 58
使用preg_split.使用"u"修饰符,它支持UTF-8 unicode.
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Run Code Online (Sandbox Code Playgroud)
Laj*_*ros 34
Preg split会因为内存异常而失败超大字符串而mb_substr确实很慢,所以这里有一个简单而有效的代码,我敢肯定,你可以使用:
function nextchar($string, &$pointer){
if(!isset($string[$pointer])) return false;
$char = ord($string[$pointer]);
if($char < 128){
return $string[$pointer++];
}else{
if($char < 224){
$bytes = 2;
}elseif($char < 240){
$bytes = 3;
}else{
$bytes = 4;
}
$str = substr($string, $pointer, $bytes);
$pointer += $bytes;
return $str;
}
}
Run Code Online (Sandbox Code Playgroud)
这用于循环遍历char的多字节字符串char,如果我将其更改为下面的代码,性能差异很大:
function nextchar($string, &$pointer){
if(!isset($string[$pointer])) return false;
return mb_substr($string, $pointer++, 1, 'UTF-8');
}
Run Code Online (Sandbox Code Playgroud)
使用它使用下面的代码循环一个字符串10000次,第一个代码产生3秒的运行时间,第二个代码产生13秒的运行时间:
function microtime_float(){
list($usec, $sec) = explode(' ', microtime());
return ((float)$usec + (float)$sec);
}
$source = 'árvízt?r? tükörfúrógépárvízt?r? tükörfúrógépárvízt?r? tükörfúrógépárvízt?r? tükörfúrógépárvízt?r? tükörfúrógép';
$t = Array(
0 => microtime_float()
);
for($i = 0; $i < 10000; $i++){
$pointer = 0;
while(($chr = nextchar($source, $pointer)) !== false){
//echo $chr;
}
}
$t[] = microtime_float();
echo $t[1] - $t[0].PHP_EOL.PHP_EOL;
Run Code Online (Sandbox Code Playgroud)
czu*_*zuk 22
回答@Pekla和@Col发表的评论.我preg_split
与之相比的弹片mb_substr
.
图像显示,preg_split
耗时1.2 mb_substr
秒,而差不多25秒.
这是函数的代码:
function split_preg($str){
return preg_split('//u', $str, -1);
}
function split_mb($str){
$length = mb_strlen($str);
$chars = array();
for ($i=0; $i<$length; $i++){
$chars[] = mb_substr($str, $i, 1);
}
$chars[] = "";
return $chars;
}
Run Code Online (Sandbox Code Playgroud)
使用Lajos Meszaros的精彩功能作为灵感,我创建了一个多字节字符串迭代器类.
// Multi-Byte String iterator class
class MbStrIterator implements Iterator
{
private $iPos = 0;
private $iSize = 0;
private $sStr = null;
// Constructor
public function __construct(/*string*/ $str)
{
// Save the string
$this->sStr = $str;
// Calculate the size of the current character
$this->calculateSize();
}
// Calculate size
private function calculateSize() {
// If we're done already
if(!isset($this->sStr[$this->iPos])) {
return;
}
// Get the character at the current position
$iChar = ord($this->sStr[$this->iPos]);
// If it's a single byte, set it to one
if($iChar < 128) {
$this->iSize = 1;
}
// Else, it's multi-byte
else {
// Figure out how long it is
if($iChar < 224) {
$this->iSize = 2;
} else if($iChar < 240){
$this->iSize = 3;
} else if($iChar < 248){
$this->iSize = 4;
} else if($iChar == 252){
$this->iSize = 5;
} else {
$this->iSize = 6;
}
}
}
// Current
public function current() {
// If we're done
if(!isset($this->sStr[$this->iPos])) {
return false;
}
// Else if we have one byte
else if($this->iSize == 1) {
return $this->sStr[$this->iPos];
}
// Else, it's multi-byte
else {
return substr($this->sStr, $this->iPos, $this->iSize);
}
}
// Key
public function key()
{
// Return the current position
return $this->iPos;
}
// Next
public function next()
{
// Increment the position by the current size and then recalculate
$this->iPos += $this->iSize;
$this->calculateSize();
}
// Rewind
public function rewind()
{
// Reset the position and size
$this->iPos = 0;
$this->calculateSize();
}
// Valid
public function valid()
{
// Return if the current position is valid
return isset($this->sStr[$this->iPos]);
}
}
Run Code Online (Sandbox Code Playgroud)
它可以像这样使用
foreach(new MbStrIterator("K?t") as $c) {
echo "{$c}\n";
}
Run Code Online (Sandbox Code Playgroud)
哪个会输出
K
?
t
Run Code Online (Sandbox Code Playgroud)
或者如果你真的想知道起始字节的位置
foreach(new MbStrIterator("K?t") as $i => $c) {
echo "{$i}: {$c}\n";
}
Run Code Online (Sandbox Code Playgroud)
哪个会输出
0: K
1: ?
3: t
Run Code Online (Sandbox Code Playgroud)
您可以解析字符串的每个字节并确定它是单个 (ASCII) 字符还是多字节字符的开头:
\n\n\n\n\nUTF-8 编码是可变宽度的,每个字符由 1 到 4 个字节表示。每个字节都有 0\xe2\x80\x934 开头的连续“1”位,后跟一个“0”位来指示其类型。2 个或更多“1”位表示那么多字节序列中的第一个字节。
\n
您将遍历该字符串,而不是将位置增加 1,而是完整读取当前字符,然后将位置增加该字符的长度。
\n\n维基百科文章有每个字符的解释表[检索于2010-10-01]:
\n\n 0-127 Single-byte encoding (compatible with US-ASCII)\n 128-191 Second, third, or fourth byte of a multi-byte sequence\n 192-193 Overlong encoding: start of 2-byte sequence, \n but would encode a code point \xe2\x89\xa4 127\n ........\n
Run Code Online (Sandbox Code Playgroud)\n