Sam*_*mWM 36 html php string markup
我想截断一些文本(从数据库或文本文件加载),但它包含HTML,因此包含标记,将返回更少的文本.这可能导致标签未被关闭或部分关闭(因此整洁可能无法正常工作且内容仍然较少).如何根据文本进行截断(当你到达表时可能会停止,因为这可能会导致更复杂的问题).
substr("Hello, my <strong>name</strong> is <em>Sam</em>. I´m a web developer.",0,26)."..."
Run Code Online (Sandbox Code Playgroud)
会导致:
Hello, my <strong>name</st...
Run Code Online (Sandbox Code Playgroud)
我想要的是:
Hello, my <strong>name</strong> is <em>Sam</em>. I´m...
Run Code Online (Sandbox Code Playgroud)
我怎样才能做到这一点?
虽然我的问题是如何在PHP中完成它,但是知道如何在C#中执行它会很好...或者应该没问题,因为我认为我可以将方法移植过来(除非它是内置的方法).
另请注意,我已经包含了一个HTML实体´
- 必须将其视为单个字符(而不是本示例中的7个字符).
strip_tags
是一个后备,但我会失去格式和链接,它仍然会有HTML实体的问题.
Sør*_*org 47
假设您使用的是有效的XHTML,则解析HTML并确保正确处理标记很简单.您只需跟踪到目前为止已打开的标签,并确保在"出路"时再次关闭它们.
<?php
header('Content-type: text/plain; charset=utf-8');
function printTruncated($maxLength, $html, $isUtf8=true)
{
$printedLength = 0;
$position = 0;
$tags = array();
// For UTF-8, we need to count multibyte sequences as one character.
$re = $isUtf8
? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
: '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';
while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
{
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = substr($html, $position, $tagPosition - $position);
if ($printedLength + strlen($str) > $maxLength)
{
print(substr($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += strlen($str);
if ($printedLength >= $maxLength) break;
if ($tag[0] == '&' || ord($tag) >= 0x80)
{
// Pass the entity or UTF-8 multibyte sequence through unchanged.
print($tag);
$printedLength++;
}
else
{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/')
{
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[strlen($tag) - 2] == '/')
{
// Self-closing tag.
print($tag);
}
else
{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < strlen($html))
print(substr($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
}
printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); print("\n");
printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");
printTruncated(10, "<em><b>Hello</b>w\xC3\xB8rld!</em>"); print("\n");
Run Code Online (Sandbox Code Playgroud)
编码注释:上面的代码假设XHTML是UTF-8编码的.也支持ASCII兼容的单字节编码(例如Latin-1),只false
作为第三个参数传递.不支持其他多字节编码,但您可能会mb_convert_encoding
在调用函数之前使用转换为UTF-8,然后在每个print
语句中再次转换回来支持.
(不过你应该总是使用UTF-8.)
编辑:更新以处理字符实体和UTF-8.修复了如果该字符是字符实体,函数将打印一个字符太多的错误.
我已经编写了一个按照你的建议截断HTML的函数,但不是将其打印出来,而是将它全部保存在字符串变量中.处理HTML实体.
/**
* function to truncate and then clean up end of the HTML,
* truncates by counting characters outside of HTML tags
*
* @author alex lockwood, alex dot lockwood at websightdesign
*
* @param string $str the string to truncate
* @param int $len the number of characters
* @param string $end the end string for truncation
* @return string $truncated_html
*
* **/
public static function truncateHTML($str, $len, $end = '…'){
//find all tags
$tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i'; //match html tags and entities
preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
//WSDDebug::dump($matches); exit;
$i =0;
//loop through each found tag that is within the $len, add those characters to the len,
//also track open and closed tags
// $matches[$i][0] = the whole tag string --the only applicable field for html enitities
// IF its not matching an &htmlentity; the following apply
// $matches[$i][1] = the start of the tag either '<' or '</'
// $matches[$i][2] = the tag name
// $matches[$i][3] = the end of the tag
//$matces[$i][$j][0] = the string
//$matces[$i][$j][1] = the str offest
while($matches[$i][0][1] < $len && !empty($matches[$i])){
$len = $len + strlen($matches[$i][0][0]);
if(substr($matches[$i][0][0],0,1) == '&' )
$len = $len-1;
//if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
//ignore empty/singleton tags for tag counting
if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
//double check
if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
$openTags[] = $matches[$i][2][0];
elseif(end($openTags) == $matches[$i][2][0]){
array_pop($openTags);
}else{
$warnings[] = "html has some tags mismatched in it: $str";
}
}
$i++;
}
$closeTags = '';
if (!empty($openTags)){
$openTags = array_reverse($openTags);
foreach ($openTags as $t){
$closeTagString .="</".$t . ">";
}
}
if(strlen($str)>$len){
// Finds the last space from the string new length
$lastWord = strpos($str, ' ', $len);
if ($lastWord) {
//truncate with new len last word
$str = substr($str, 0, $lastWord);
//finds last character
$last_character = (substr($str, -1, 1));
//add the end text
$truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
}
//restore any open tags
$truncated_html .= $closeTagString;
}else
$truncated_html = $str;
return $truncated_html;
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
29038 次 |
最近记录: |