截断包含HTML的文本,忽略标记

Question

截断包含HTML的文本,忽略标记

我想截断一些文本(从数据库或文本文件加载),但它包含HTML,因此包含标记,将返回更少的文本.这可能导致标签未被关闭或部分关闭(因此整洁可能无法正常工作且内容仍然较少).如何根据文本进行截断(当你到达表时可能会停止,因为这可能会导致更复杂的问题).

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

Run Code Online (Sandbox Code Playgroud)

会导致:

Hello, my <strong>name</st...

Run Code Online (Sandbox Code Playgroud)

我想要的是:

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

Run Code Online (Sandbox Code Playgroud)

我怎样才能做到这一点？

虽然我的问题是如何在PHP中完成它,但是知道如何在C#中执行它会很好...或者应该没问题,因为我认为我可以将方法移植过来(除非它是内置的方法).

另请注意,我已经包含了一个HTML实体´- 必须将其视为单个字符(而不是本示例中的7个字符).

strip_tags 是一个后备,但我会失去格式和链接,它仍然会有HTML实体的问题.

Answer 1

Sør*_*org 47

假设您使用的是有效的XHTML,则解析HTML并确保正确处理标记很简单.您只需跟踪到目前为止已打开的标签,并确保在"出路"时再次关闭它们.

<?php
header('Content-type: text/plain; charset=utf-8');

function printTruncated($maxLength, $html, $isUtf8=true)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    // For UTF-8, we need to count multibyte sequences as one character.
    $re = $isUtf8
        ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
        : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';

    while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);
        if ($printedLength >= $maxLength) break;

        if ($tag[0] == '&' || ord($tag) >= 0x80)
        {
            // Pass the entity or UTF-8 multibyte sequence through unchanged.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}


printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("\n");

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");

printTruncated(10, "<em><b>Hello</b>&#20;w\xC3\xB8rld!</em>"); print("\n");

Run Code Online (Sandbox Code Playgroud)

编码注释:上面的代码假设XHTML是UTF-8编码的.也支持ASCII兼容的单字节编码(例如Latin-1),只false作为第三个参数传递.不支持其他多字节编码,但您可能会mb_convert_encoding在调用函数之前使用转换为UTF-8,然后在每个print语句中再次转换回来支持.

(不过你应该总是使用UTF-8.)

编辑:更新以处理字符实体和UTF-8.修复了如果该字符是字符实体,函数将打印一个字符太多的错误.

Answer 2

alo*_*d05 5

我已经编写了一个按照你的建议截断HTML的函数,但不是将其打印出来,而是将它全部保存在字符串变量中.处理HTML实体.

 /**
     *  function to truncate and then clean up end of the HTML,
     *  truncates by counting characters outside of HTML tags
     *  
     *  @author alex lockwood, alex dot lockwood at websightdesign
     *  
     *  @param string $str the string to truncate
     *  @param int $len the number of characters
     *  @param string $end the end string for truncation
     *  @return string $truncated_html
     *  
     *  **/
        public static function truncateHTML($str, $len, $end = '&hellip;'){
            //find all tags
            $tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i';  //match html tags and entities
            preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
            //WSDDebug::dump($matches); exit; 
            $i =0;
            //loop through each found tag that is within the $len, add those characters to the len,
            //also track open and closed tags
            // $matches[$i][0] = the whole tag string  --the only applicable field for html enitities  
            // IF its not matching an &htmlentity; the following apply
            // $matches[$i][1] = the start of the tag either '<' or '</'  
            // $matches[$i][2] = the tag name
            // $matches[$i][3] = the end of the tag
            //$matces[$i][$j][0] = the string
            //$matces[$i][$j][1] = the str offest

            while($matches[$i][0][1] < $len && !empty($matches[$i])){

                $len = $len + strlen($matches[$i][0][0]);
                if(substr($matches[$i][0][0],0,1) == '&' )
                    $len = $len-1;


                //if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
                //ignore empty/singleton tags for tag counting
                if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
                    //double check 
                    if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
                        $openTags[] = $matches[$i][2][0];
                    elseif(end($openTags) == $matches[$i][2][0]){
                        array_pop($openTags);
                    }else{
                        $warnings[] = "html has some tags mismatched in it:  $str";
                    }
                }


                $i++;

            }

            $closeTags = '';

            if (!empty($openTags)){
                $openTags = array_reverse($openTags);
                foreach ($openTags as $t){
                    $closeTagString .="</".$t . ">"; 
                }
            }

            if(strlen($str)>$len){
                // Finds the last space from the string new length
                $lastWord = strpos($str, ' ', $len);
                if ($lastWord) {
                    //truncate with new len last word
                    $str = substr($str, 0, $lastWord);
                    //finds last character
                    $last_character = (substr($str, -1, 1));
                    //add the end text
                    $truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
                }
                //restore any open tags
                $truncated_html .= $closeTagString;


            }else
            $truncated_html = $str;


            return $truncated_html; 
        }

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，6 月前
查看次数：	29038 次
最近记录：	7 年，12 月前