DOMDocument 似乎将中文字符转换为代码,例如,
你的乱发将成为ä½ çš„ä¹±å‘
如何保留中文或其他外语,而不是将其转换为代码?
以下是我的简单测试,
$dom = new DOMDocument();
$dom->loadHTML($html);
Run Code Online (Sandbox Code Playgroud)
如果我在loadHTML()之前添加以下内容,
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
Run Code Online (Sandbox Code Playgroud)
我明白了
你的乱发
Run Code Online (Sandbox Code Playgroud)
即使隐蔽的代码将显示为汉字,你的乱发仍然不是????我追求的......
如何将Unicode字符串转换为HTML实体?(HEX不是十进制)
例如,转换Français为Français.
据我了解,默认情况下 loadHTML 加载拉丁文 1 的内容,我想将其转换为 UTF-8 字符。代码如下:
// get data from website
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_ENCODING, 'UTF-8');
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
// Now here is the domdoc
function get_all_meta_tags($html){
$html = get_url_contents($html);
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->encoding = 'UTF-8';
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
$arr['title']=$title;
$nodes = $doc->getElementsByTagName('h1');
$h1 = $nodes->item(0)->nodeValue;
$arr['h1']=$h1;
$metas = $doc->getElementsByTagName('meta');
for ($i …Run Code Online (Sandbox Code Playgroud)