如何在PHP中替换Microsoft编码的引号

Mis*_*a M 68 php string encoding character-encoding

“ ” ‘ ’由于我的应用程序中存在编码问题,我需要用常规引号('和")替换单引号和双引号()的Microsoft Word版本.我不需要它们是HTML实体,我不能更改我的数据库架构.

我有两个选择:使用正则表达式或关联数组.

有一个更好的方法吗?

Jus*_*nic 102

我找到了这个问题的答案.你需要使用iconv()php中的函数只需一行代码:

// replace Microsoft Word version of single  and double quotations marks (“ ” ‘ ’) with  regular quotes (' and ")
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);     
Run Code Online (Sandbox Code Playgroud)

  • 谢谢,但在我的情况下,我需要选择正确的字符编码(CP1252而不是UTF-8):`$ output = iconv('CP1252','ASCII // TRANSLIT',$ input);` (9认同)
  • 是的,这对我有用.我推荐这个超过了接受的答案:) (3认同)

Pas*_*TIN 86

考虑到你只想要替换一些特定且识别良好的角色,我会选择str_replace一个阵列:你显然不需要重炮将正则表达式带给你;-)

如果你遇到其他一些特殊字符(该死的复制粘贴来自word ...),你可以随时将它们添加到该数组中/只要它们被识别出来.


编辑:我可以给你的评论的最佳答案可能是这个链接:用PHP转换智能报价

和相关的代码(引用该页面):

function convert_smart_quotes($string) 
{ 
    $search = array(chr(145), 
                    chr(146), 
                    chr(147), 
                    chr(148), 
                    chr(151)); 

    $replace = array("'", 
                     "'", 
                     '"', 
                     '"', 
                     '-'); 

    return str_replace($search, $replace, $string); 
} 
Run Code Online (Sandbox Code Playgroud)

(我这台电脑上没有MS字,所以我不能自己测试)

我不记得我们在工作中使用了什么(我不是必须处理那种输入的人),但它是同样的东西......


Gum*_*mbo 36

您的Microsoft编码引号可能是印刷引号.str_replace如果您知道要替换它们的字符串的编码,则可以简单地替换它们.

这是UTF-8的示例,但使用单个映射数组strtr:

$quotes = array(
    "\xC2\xAB"     => '"', // « (U+00AB) in UTF-8
    "\xC2\xBB"     => '"', // » (U+00BB) in UTF-8
    "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
    "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8
    "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8
    "\xE2\x80\x9B" => "'", // ? (U+201B) in UTF-8
    "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
    "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
    "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8
    "\xE2\x80\x9F" => '"', // ? (U+201F) in UTF-8
    "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8
    "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8
);
$str = strtr($str, $quotes);
Run Code Online (Sandbox Code Playgroud)

如果您需要其他编码,可以使用mb_convert_encoding转换键.

  • @R ..:这就是问题:有很多人对字符编码和/或他们使用的字符编码知之甚少. (3认同)

the*_*dow 11

如果像我一样,你带着大量破碎的ascii/ms单词字符到达这里,这些字符正在为你的CMS或RTE做出奇怪的事情而且iconv无效,那么这个疯狂的功能可能只适合你.

将此功能保存到文件时,请确保编码为utf-8.

<?php
    /**
     * fixMSWord
     *
     * Replace ASCII chars with UTF-8. Note there are ASCII characters that don't
     * correctly map and will be replaced by spaces.
     *
     * @author      Robin Cafolla
     * @date        2013-03-22
     */
    function fixMSWord($string) {
        $map = Array(
            '33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*',
            '43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4',
            '53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>',
            '63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H',
            '73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R',
            '83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\\',
            '93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f',
            '103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p',
            '113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z',
            '123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '&#8364;', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"',
            '133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ',
            '143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~',
            '153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢',
            '163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬',
            '173'=> '­', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶',
            '183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À',
            '193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê',
            '203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô',
            '213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ',
            '223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è',
            '233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò',
            '243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü',
            '253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ'
        );

        $search = Array();
        $replace = Array();

        foreach ($map as $s => $r) {
            $search[] = chr((int)$s);
            $replace[] = $r;
        }

        return str_replace($search, $replace, $string);
    }
Run Code Online (Sandbox Code Playgroud)

  • 您决定将许可证放在基本上等于......阵列的许可证上? (3认同)
  • 您在答案中放置的许可证无关紧要,所有用户内容均按**cc by-sa 3.0许可,且需要归属**.你可以在页脚中看到这个.此代码不再受MIT许可. (3认同)

Nob*_*ift 6

除了Gumbo 之外,之前的每一个答案都会破坏 Unicode 字符串:

\n\n
echo convert_smart_quotes("This is Yi: \xea\x91\x91. Point \xe2\x92\x92 this breaks Yi. Yi broke\xe2\x80\x93why? I need a longer\xe2\x80\x93\xe2\x80\x93point. This makes Han \xe5\x97\x97 mad.");\n
Run Code Online (Sandbox Code Playgroud)\n\n

结果是:

\n\n
This is Yi: ?\'\'. Point ?\'\' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad.\n
Run Code Online (Sandbox Code Playgroud)\n\n

图标:

\n\n
$output = iconv(\'UTF-8\', \'ASCII//TRANSLIT\', $input);\n
Run Code Online (Sandbox Code Playgroud)\n\n

结果是:

\n\n
\n

PHP 通知: iconv():在 php shell 代码第 1 行的输入字符串中检测到非法字符

\n
\n\n

你可以将其更改为//IGNORE,这将删除字符,但不翻译它们。

\n\n

这是替换 CP1252 中编码的 Microsoft 引号的最佳方法。如果它们是 Unicode 格式并且您需要替换它们,请使用 Gumbo 的答案:

\n\n
function convert_cp1252_to_ascii($input, $default = \'\') {\n    if ($input === null || $input == \'\') {\n        return $default;\n    }\n\n    // https://en.wikipedia.org/wiki/UTF-8\n    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1\n    // https://en.wikipedia.org/wiki/Windows-1252\n    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT\n    $encoding = mb_detect_encoding($input, array(\'Windows-1252\', \'ISO-8859-1\'), true);\n    if ($encoding == \'ISO-8859-1\' || $encoding == \'Windows-1252\') {\n        /*\n         * Use the search/replace arrays if a character needs to be replaced with\n         * something other than its Unicode equivalent.\n         */\n\n        $replace = array(\n            128 => "E",    // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN\n            129 => "",     // UNDEFINED\n            130 => ",",    // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK\n            131 => "f",    // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK\n            132 => ",,",   // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK\n            133 => "...",  // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS\n            134 => "t",    // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER\n            135 => "T",    // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER\n            136 => "^",    // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT\n            137 => "%",    // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN\n            138 => "S",    // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON\n            139 => "<",    // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK\n            140 => "OE",   // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE\n            141 => "",     // UNDEFINED\n            142 => "Z",    // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON\n            143 => "",     // UNDEFINED\n            144 => "",     // UNDEFINED\n            145 => "\'",    // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK\n            146 => "\'",    // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK\n            147 => "\\"",   // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK\n            148 => "\\"",   // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK\n            149 => "*",    // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET\n            150 => "-",    // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH\n            151 => "--",   // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH\n            152 => "~",    // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE\n            153 => "TM",   // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN\n            154 => "s",    // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON\n            155 => ">",    // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK\n            156 => "oe",   // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE\n            157 => "",     // UNDEFINED\n            158 => "z",    // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON\n            159 => "Y",    // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS\n        );\n\n        $find = array();\n        foreach (array_keys($replace) as $key) {\n            $find[] = chr($key);\n        }\n\n        $input = str_replace($find, array_values($replace), $input);\n        /*\n         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F\n         * and control characters, always convert from Windows-1252 to UTF-8.\n         */\n        $input = iconv(\'Windows-1252\', \'UTF-8//IGNORE\', $input);\n    }\n    return $input;\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

摘自这个答案,并进行了一些修改。如果您想控制查找/替换的内容,请使用该功能。

\n


cee*_*yoz 5

我们使用了以下内容.处理一些特殊字符.

$text = str_replace(chr(130), ',', $text);    // Baseline single quote
$text = str_replace(chr(132), '"', $text);    // Baseline double quote
$text = str_replace(chr(133), '...', $text);  // Ellipsis
$text = str_replace(chr(145), "'", $text);    // Left single quote
$text = str_replace(chr(146), "'", $text);    // Right single quote
$text = str_replace(chr(147), '"', $text);    // Left double quote
$text = str_replace(chr(148), '"', $text);    // Right double quote

$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');
Run Code Online (Sandbox Code Playgroud)