PHP字符编码地狱用fgets读取csv文件

Geo*_*rge 4 php encoding fgets

我有一个网站,每月通过 FTP 接收一次 CSV 文件。多年来它都是一个 ASCII 文件。现在我一个月收到 UTF-8,下个月收到 UTF-16BE,再下一个月收到 UTF-16LE。也许下个月我会得到 UTF-32。Fgets 返回 UTF 文件开头的字节顺序标记。如何让PHP自动识别字符编码?我尝试过 mb_detect_encoding,无论文件类型如何,它都会返回 ASCII。我更改了代码以读取 BOM 并显式地将字符编码放入 mb_convert_encoding 中。这一直有效到最新的文件,即 UTF-16LE。在此文件中,它正确读取第一行,所有后续行显示为问号(“?”)。我究竟做错了什么?

$fhandle = fopen( $file_in, "r" );
if ( fhandle === false )
    {
    echo "<p class=redbold>Error opening file $file_in.</p>";
    die();
    }

$i = 0;
while( ( $line = fgets( $fhandle ) ) !== false )
{
$i++;

// Detect encoding on first line. Actual text always begins with string "Document"
if ( $i == 1 )
    {
    $line_start = substr( $line, 0, 4 );
    $line_start_hex = bin2hex( $line_start );
    $utf16_start = 'fffe4400';
    $utf8_start = 'efbbbf44';
    if ( strcmp( $line_start, 'Docu' ) == 0 )
        { $char_encoding = 'ASCII'; }
    elseif ( strcmp( $line_start_hex, 'efbbbf44' ) == 0 )
        {
        $char_encoding = 'UTF-8';
        $line = substr( $line, 3 );
        }
    elseif ( strcmp( $line_start_hex, 'fffe4400' ) == 0 )
        {
        $char_encoding = 'UTF-16LE';
        $line = substr( $line, 2 );
        }
    elseif ( strcmp( $line_start_hex, 'feff4400' ) == 0 )
        {
        $char_encoding = 'UTF-16BE';
        $line = substr( $line, 2 );
        }
    else
        {
        echo "<p class=redbold>Error, unknown character encoding. Line =<br>", $line_start_hex, '</p>';
        require( '../footer.php' );
        die();
        }
    echo "<p>char_encoding = $char_encoding</p>";
    }

// Convert UTF
if ( $char_encoding != 'ASCII' )
    {
    $line = mb_convert_encoding( $line, 'ASCII', $char_encoding);
    }

echo '<p>'; var_dump( $line ); echo '</p>';
}
Run Code Online (Sandbox Code Playgroud)

输出:

    char_encoding = UTF-16LE

string(101) "DocumentNumber,RecordedTS,Title,PageCount,City,TransTaxAccountCode,TotalTransferTax,Description,Name
"

string(83) "???????????????????????????????????????????????????????????????????????????????????"

string(88) "????????????????????????????????????????????????????????????????????????????????????????"

string(84) "????????????????????????????????????????????????????????????????????????????????????"

string(80) "????????????????????????????????????????????????????????????????????????????????"
Run Code Online (Sandbox Code Playgroud)

Esa*_*ija 5

显式传递要检测的顺序和可能的编码,并使用严格的参数。另外请使用file_get_contents,如果文件是 UTF-16LE 格式,fgets会搞砸的。

<?php
header( "Content-Type: text/html; charset=utf-8");
$input = file_get_contents( $file_in );

$encoding = mb_detect_encoding( $input, array(
    "UTF-8",
    "UTF-32",
    "UTF-32BE",
    "UTF-32LE",
    "UTF-16",
    "UTF-16BE",
    "UTF-16LE"
), TRUE );

if( $encoding !== "UTF-8" ) {
    $input = mb_convert_encoding( $input, "UTF-8", $encoding );
}
echo "<p>$encoding</p>";

foreach( explode( PHP_EOL, $input ) as $line ) {
    var_dump( $line );
}
Run Code Online (Sandbox Code Playgroud)

顺序很重要,因为 UTF-8 和 UTF-32 限制性更强,而 UTF-16 极其宽松;几乎任何随机偶数长度的字节都是有效的 UTF-16。

保留所有信息的唯一方法是将其转换为 unicode 编码,而不是 ASCII。