jev*_*von 97
使用html2text(示例HTML到文本),根据Eclipse Public License许可.它使用PHP的DOM方法从HTML加载,然后迭代生成的DOM以提取纯文本.用法:
// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);
// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);
Run Code Online (Sandbox Code Playgroud)
虽然不完整,但它是开源的,欢迎提供.
其他转换脚本的问题:
T.T*_*dua 17
这是另一个解决方案:
$cleaner_input = strip_tags($text);
Run Code Online (Sandbox Code Playgroud)
有关消毒功能的其他变体,请参阅:
https:// RunFor github.com/tazotodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php
lke*_*ler 13
使用DOMDocument从HTML转换为文本是一种可行的解决方案.考虑HTML2Text,它需要PHP5:
关于UTF-8,"howto"页面上的注释说明:
PHP自己对unicode的支持很差,而且它并不总能正确处理utf-8.虽然html2text脚本使用unicode-safe方法(不需要mbstring模块),但它无法始终处理PHP自己的编码处理.PHP并不真正理解像utf-8这样的unicode或编码,并使用系统的基本编码,它往往是ISO-8859系列之一.因此,在文本编辑器中看起来像utf-8或单字节的有效字符可能会被PHP误解为错误.所以,即使你认为你正在向html2text中提供一个有效的角色,你可能也不会.
作者提供了几种解决方法,并指出HTML2Text的第2版(使用DOMDocument)具有UTF-8支持.
请注意商业用途的限制.
pes*_*669 11
有可靠的strip_tags函数.虽然它不漂亮.它只会消毒.您可以将它与字符串替换组合以获得您喜欢的下划线.
<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));
// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));
?>
Run Code Online (Sandbox Code Playgroud)
你可以使用lynx和-stdin和-dump选项来实现:
<?php
$descriptorspec = array(
0 => array("pipe", "r"), // stdin is a pipe that the child will read from
1 => array("pipe", "w"), // stdout is a pipe that the child will write to
2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);
$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);
if (is_resource($process)) {
// $pipes now looks like this:
// 0 => writeable handle connected to child stdin
// 1 => readable handle connected to child stdout
// Any error output will be appended to htmp2txt.log
$stdin = $pipes[0];
fwrite($stdin, <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>
<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
);
fclose($stdin);
echo stream_get_contents($pipes[1]);
fclose($pipes[1]);
// It is important that you close any pipes before calling
// proc_close in order to avoid a deadlock
$return_value = proc_close($process);
echo "command returned $return_value\n";
}
Run Code Online (Sandbox Code Playgroud)
小智 7
你可以测试这个功能
function html2text($Document) {
$Rules = array ('@<script[^>]*?>.*?</script>@si',
'@<[\/\!]*?[^<>]*?>@si',
'@([\r\n])[\s]+@',
'@&(quot|#34);@i',
'@&(amp|#38);@i',
'@&(lt|#60);@i',
'@&(gt|#62);@i',
'@&(nbsp|#160);@i',
'@&(iexcl|#161);@i',
'@&(cent|#162);@i',
'@&(pound|#163);@i',
'@&(copy|#169);@i',
'@&(reg|#174);@i',
'@&#(d+);@e'
);
$Replace = array ('',
'',
'',
'',
'&',
'<',
'>',
' ',
chr(161),
chr(162),
chr(163),
chr(169),
chr(174),
'chr()'
);
return preg_replace($Rules, $Replace, $Document);
}
Run Code Online (Sandbox Code Playgroud)
我没有找到适合的任何现有解决方案 - 简单的HTML电子邮件到简单的纯文本文件.
我打开了这个存储库,希望它可以帮助某人.麻省理工学院许可证,顺便说一下:)
https://github.com/RobQuistNL/SimpleHtmlToText
例:
$myHtml = '<b>This is HTML</b><h1>Header</h1><br/><br/>Newlines';
echo (new Parser())->parseString($myHtml);
Run Code Online (Sandbox Code Playgroud)
收益:
**This is HTML**
### Header ###
Newlines
Run Code Online (Sandbox Code Playgroud)
public function plainText($text)
{
$text = strip_tags($text, '<br><p><li>');
$text = preg_replace ('/<[^>]*>/', PHP_EOL, $text);
return $text;
}
Run Code Online (Sandbox Code Playgroud)
$text = "string 1<br>string 2<br/><ul><li>string 3</li><li>string 4</li></ul><p>string 5</p>";
echo planText($text);
输出
字符串 1
字符串 2
字符串 3
字符串 4
字符串 5