在HTML中搜索2个短语(忽略所有标签)并删除其他所有内容

Question

在HTML中搜索2个短语(忽略所有标签)并删除其他所有内容

我有一个存储在字符串中的html代码,例如:

$html = '
        <html>
        <body>
        <p>Hello <em>?????</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.</p>
        </body>
        </html>
        ';

Run Code Online (Sandbox Code Playgroud)

然后我有两个句子存储在变量中:

$begin = 'Hello ?????!';
$end = 'Lorem ipsum.';

Run Code Online (Sandbox Code Playgroud)

我想搜索$html这两个句子,并删除它们之前和之后的所有内容.所以$html将成为:

$html = 'Hello <em>?????</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.';

Run Code Online (Sandbox Code Playgroud)

我怎样才能做到这一点？请注意,$begin和$end变量没有html标签,但$html很可能的句子有如上所示的标签.

也许正则表达式方法？

到目前为止我尝试过的

一种strpos()方法.问题是$html在句子中包含标签,使得$begin和$end句子不匹配.我可以strip_tags($html)在跑步之前strpos(),但那时我显然会在$html没有标签的情况下结束.
搜索变量的一部分,比如Hello,但这绝不是安全的,会给出很多匹配.

Answer 1

Wik*_*żew 12

这是一个简短的,但我相信 - 基于惰性点匹配正则表达式的工作解决方案(可以通过创建更长,展开的正则表达式来改进,但除非你有非常大的文本块,否则应该足够了).

$html = "<html>\n<body>\n<p><p>H<div>ello</div><script></script> <em>?&nbsp;&nbsp;&nbsp;????</em>!</p>\nrandom code\nrandom code\n<p>Lorem <span>ipsum<span>.</p>\n</body>\n </html>";
$begin = 'Hello     ?????!';
$end = 'Lorem ipsum.';
$begin = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $begin);
$end = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $end);
$begin_arr = preg_split('~(?=\X)~u', $begin, -1, PREG_SPLIT_NO_EMPTY);
$end_arr = preg_split('~(?=\X)~u', $end, -1, PREG_SPLIT_NO_EMPTY);
$reg = "(?s)(?:<[^<>]+>)?(?:&#?\\w+;)*\\s*" .  implode("", array_map(function($x, $k) use ($begin_arr) { return ($k < count($begin_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $begin_arr, array_keys($begin_arr)))
        . "(.*?)" . 
        implode("", array_map(function($x, $k) use ($end_arr) { return ($k < count($end_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $end_arr, array_keys($end_arr))); 
echo $reg .PHP_EOL;
preg_match('~' . $reg . '~u', $html, $m);
print_r($m[0]);

Run Code Online (Sandbox Code Playgroud)

请参阅IDEONE演示

算法:

通过将分隔符字符串拆分为单个字形(因为这些字符可以是Unicode字符,我建议使用preg_split('~(?<!^)(?=\X)~u', $end))并通过添加可选的标记匹配模式进行内爆来创建动态正则表达式模式(?:<[^<>]+>)?.
然后,(?s)在.匹配包含换行符的任何字符时启用DOTALL模式,并.*?匹配前导到尾随分隔符的0+个字符.

正则表达式详细信息:

'~(?<!^)(?=\X)~u 在每个字形之前匹配除字符串开头之外的每个位置
(示例最终正则表达式)(?s)(?:<[^<>]+>)?(?:&#?\w+;)*\s*H(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*\!(?:\s*(?:<[^<>]+>|&#?\w+;))*+ (.*?)+ L(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))*r(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*i(?:\s*(?:<[^<>]+>|&#?\w+;))*p(?:\s*(?:<[^<>]+>|&#?\w+;))*s(?:\s*(?:<[^<>]+>|&#?\w+;))*u(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))*\.- 带有可选子模式的前导和尾随分隔符,用于标记匹配和(.*?)内部(可能不需要捕获).
~u 由于要处理Unicode字符串,因此必须使用修饰符.
更新:要考虑1+个空格,可以用子模式替换begin和end模式中的任何空格,\s+以匹配输入字符串中任何类型的1+空白字符.
更新2:辅助$begin = preg_replace('~\s+~u', ' ', $begin);,$end = preg_replace('~\s+~u', ' ', $end);并且必须考虑输入字符串中的1+空格.
为了解释HTML实体,再添子模式的可选配件:&#?\\w+;,它也将匹配 和{类似实体.它还\s*可以与可选的空格匹配,并用*(可以为零或更多)进行量化.

Answer 2

Dáv*_*áth 8

我真的很想写一个正则表达式解决方案.但我之前有一些很好的复杂解决方案.所以,这是一个非正则表达式解决方案.

简短说明:主要问题是保留HTML标记.如果HTML标签被剥离,我们可以轻松搜索文本.所以:脱掉这些!我们可以轻松搜索剥离的内容,并生成我们想要剪切的子字符串.然后,尝试在保留标记的同时从HTML中剪切此子字符串.

好处:

搜索很容易且独立于HTML,如果需要,您也可以使用正则表达式进行搜索
需求是可扩展的:您可以轻松添加完整的多字节支持,实体支持和空白崩溃等
相对较快(有可能,直接正则表达式可以更快)
不接触原始HTML,并且适用于其他标记语言

此方案的静态实用程序类:

class HtmlExtractUtil
{

    const FAKE_MARKUP = '<>';
    const MARKUP_PATTERN = '#<[^>]+>#u';

    static public function extractBetween($html, $startTextToFind, $endTextToFind)
    {
        $strippedHtml = preg_replace(self::MARKUP_PATTERN, '', $html);
        $startPos = strpos($strippedHtml, $startTextToFind);
        $lastPos = strrpos($strippedHtml, $endTextToFind);

        if ($startPos === false || $lastPos === false) {
            return "";
        }

        $endPos = $lastPos + strlen($endTextToFind);
        if ($endPos <= $startPos) {
            return "";
        }

        return self::extractSubstring($html, $startPos, $endPos);
    }

    static public function extractSubstring($html, $startPos, $endPos)
    {
        preg_match_all(self::MARKUP_PATTERN, $html, $matches, PREG_OFFSET_CAPTURE);
        $start = -1;
        $end = -1;
        $previousEnd = 0;
        $stripPos = 0;
        $matchArray = $matches[0];
        $matchArray[] = [self::FAKE_MARKUP, strlen($html)];
        foreach ($matchArray as $match) {
            $diff = $previousEnd - $stripPos;
            $textLength = $match[1] - $previousEnd;
            if ($start == (-1)) {
                if ($startPos >= $stripPos && $startPos < $stripPos + $textLength) {
                    $start = $startPos + $diff;
                }
            }
            if ($end == (-1)) {
                if ($endPos > $stripPos && $endPos <= $stripPos + $textLength) {
                    $end = $endPos + $diff;
                    break;
                }
            }
            $tagLength = strlen($match[0]);
            $previousEnd = $match[1] + $tagLength;
            $stripPos += $textLength;
        }

        if ($start == (-1)) {
            return "";
        } elseif ($end == (-1)) {
            return substr($html, $start);
        } else {
            return substr($html, $start, $end - $start);
        }
    }

}

Run Code Online (Sandbox Code Playgroud)

用法:

$html = '
<html>
<body>
<p>Any string before</p>
<p>Hello <em>?????</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.</p>
<p>Any string after</p>
</body>
</html>
';
$startTextToFind = 'Hello ?????!';
$endTextToFind = 'Lorem ipsum.';

$extractedText = HtmlExtractUtil::extractBetween($html, $startTextToFind, $endTextToFind);

header("Content-type: text/plain; charset=utf-8");
echo $extractedText . "\n";

Run Code Online (Sandbox Code Playgroud)

Answer 3

tri*_*cot 7

正则表达式在解析HTML时有其局限性.像许多人在我之前所做的那样,我将参考这个着名的答案.

依赖正则表达式时的潜在问题

例如,假设此标记出现在必须提取的部分之前的HTML中:

<p attr="Hello ?????!">This comes before the match</p>

Run Code Online (Sandbox Code Playgroud)

许多正则表达式解决方案都会遇到这种情况,并返回一个从此开始p标记中间开始的字符串.

或者考虑HTML部分中必须匹配的注释:

<!-- Next paragraph will display "Lorem ipsum." -->

Run Code Online (Sandbox Code Playgroud)

或者,出现一些松散的小于和大于符号(让我们在评论或属性值中说):

<!-- Next paragraph will display >-> << Lorem ipsum. >> -->
<p data-attr="->->->" class="myclass">

Run Code Online (Sandbox Code Playgroud)

这些正则表达式会对此做些什么？

这些只是示例......有无数其他情况会给基于正则表达式的解决方案带来问题.

有更可靠的方法来解析HTML.

将HTML加载到DOM中

我将在这里建议一个基于DOMDocument接口的解决方案,使用这个算法:

获取HTML文档的文本内容,并标识子字符串(开始/结束)所在的两个偏移量.
然后浏览DOM文本节点,跟踪这些节点所适合的偏移.在两个边界偏移中的任何一个交叉的节点中,|插入预定义的分隔符().该分隔符不应出现在HTML字符串中.因此,在满足该条件之前,它会加倍(||,||||...);
最后,通过此分隔符拆分HTML表示,并将中间部分作为结果提取.

这是代码:

function extractBetween($html, $begin, $end) {
    $dom = new DOMDocument();
    // Load HTML in DOM, making sure it supports UTF-8; double HTML tags are no problem
    $dom->loadHTML('<html><head>
            <meta http-equiv="content-type" content="text/html; charset=utf-8">
        </head></html>' . $html);
    // Get complete text content
    $text = $dom->textContent;
    // Get positions of the beginning/ending text; exit if not found.
    if (($from = strpos($text, $begin)) === false) return false;
    if (($to = strpos($text, $end, $from + strlen($begin))) === false) return false;
    $to += strlen($end);
    // Define a non-occurring delimiter by repeating `|` enough times:
    for ($delim = '|'; strpos($html, $delim) !== false; $delim .= $delim);
    // Use XPath to traverse the DOM
    $xpath = new DOMXPath($dom);
    // Go through the text nodes keeping track of total text length.
    // When exceeding one of the two offsets, inject a delimiter at that position.
    $pos = 0;
    foreach($xpath->evaluate("//text()") as $node) {
        // Add length of node's text content to total length
        $newpos = $pos + strlen($node->nodeValue);
        while ($newpos > $from || ($from === $to && $newpos === $from)) {
            // The beginning/ending text starts/ends somewhere in this text node.
            // Inject the delimiter at that position:
            $node->nodeValue = substr_replace($node->nodeValue, $delim, $from - $pos, 0);
            // If a delimiter was inserted at both beginning and ending texts,
            // then get the HTML and return the part between the delimiters
            if ($from === $to) return explode($delim, $dom->saveHTML())[1];
            // Delimiter was inserted at beginning text. Now search for ending text
            $from = $to;
        }
        $pos = $newpos;
    }
}

Run Code Online (Sandbox Code Playgroud)

你会这样称呼它:

// Sample input data
$html = '
        <html>
        <body>
        <p>This comes before the match</p>
        <p>Hey! Hello <em>?????</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>. la la la</p>
        <p>This comes after the match</p>
        </body>
        </html>
        ';

$begin = 'Hello ?????!';
$end = 'Lorem ipsum.';

// Call
$html = extractBetween($html, $begin, $end);

// Output result
echo $html;

Run Code Online (Sandbox Code Playgroud)

输出:

Hello <em>?????</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.

Run Code Online (Sandbox Code Playgroud)

你会发现这个代码比正则表达式更容易维护.

看它在eval.in上运行.

Answer 4

Pau*_*aul 5

到目前为止,这可能不是最佳解决方案,但我喜欢开头讨论这样的"谜语",所以这是我的方法.

<?php
$subject = ' <html> 
<body> 
<p>He<i>l</i>lo <em>Lydia</em>!</p> 
random code 
random code 
<p>Lorem <span>ipsum</span>.</p> 
</body> 
</html>';

$begin = 'Hello Lydia!';
$end = 'Lorem ipsum.';

$begin_chars = str_split($begin);
$end_chars = str_split($end);

$begin_re = '';
$end_re = '';

foreach ($begin_chars as $c) {
    if ($c == ' ') {
        $begin_re .= '(\s|(<[a-z/]+>))+';
    }
    else {
        $begin_re .= $c . '(<[a-z/]+>)?';
    }
}
foreach ($end_chars as $c) {
    if ($c == ' ') {
        $end_re .= '(\s|(<[a-z/]+>))+';
    }
    else {
        $end_re .= $c . '(<[a-z/]+>)?';
    }
}

$re = '~(.*)((' . $begin_re . ')(.*)(' . $end_re . '))(.*)~ms';

$result = preg_match( $re, $subject , $matches );
$start_tag = preg_match( '~(<[a-z/]+>)$~', $matches[1] , $stmatches );

echo $stmatches[1] . $matches[2];

Run Code Online (Sandbox Code Playgroud)

这输出:

<p>He<i>l</i>lo <em>Lydia</em>!</p> 
random code 
random code 
<p>Lorem <span>ipsum</span>.</p>

Run Code Online (Sandbox Code Playgroud)

这符合这种情况,但我认为它需要更多的逻辑来逃避正则表达式特殊字符,如句点.

一般来说,这个代码片段的作用是什么:

将字符串拆分为数组,每个数组值代表一个字符.这需要完成,因为Hello需要匹配Hel<i>l</i>o.
为此,对于正则表达式部分,(<[a-z/]+>)?在每个字符之后插入一个附加的空格字符的特殊情况.

Answer 5

Kas*_*Lee 4

你可以尝试这个正则表达式：

\n\n

(.*?)  # Data before sentences (to be removed)\n(      # Capture Both sentences and text in between\n  H.*?e.*?l.*?l.*?o.*?\\s    # Hello[space]\n  (<.*?>)*                  # Optional Opening Tag(s)\n  \xe9\x80\xb2.*?\xe6\x92\x83.*?\xe3\x81\xae.*?\xe5\xb7\xa8.*?\xe4\xba\xba.*?   # \xe9\x80\xb2\xe6\x92\x83\xe3\x81\xae\xe5\xb7\xa8\xe4\xba\xba\n  (<\\/.*?>)*                # Optional Closing Tag(s)\n  (.*?)                     # Optional Data in between sentences\n  (<.*?>)*                  # Optional Opening Tag(s)\n  L.*?o.*?r.*?e.*?m.*?\\s    # Lorem[space]\n  (<.*?>)*                  # Optional Opening Tag(s)\n  i.*?p.*?s.*?u.*?m.*?      # ipsum\n)\n(.*)   # Data after sentences (to be removed)\n

Run Code Online (Sandbox Code Playgroud)\n\n

用2nd捕获组替换

\n\n

Live Demo on Regex101

\n\n

正则表达式可以缩短为：

\n\n

(.*?)(H.*?e.*?l.*?l.*?o.*?\\s(<.*?>)*\xe9\x80\xb2.*?\xe6\x92\x83.*?\xe3\x81\xae.*?\xe5\xb7\xa8.*?\xe4\xba\xba.*?(<\\/.*?>)*(.*?)(<.*?>)*L.*?o.*?r.*?e.*?m.*?\\s(<.*?>)*i.*?p.*?s.*?u.*?m.*?)(.*)\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	9 年，6 月前
查看次数：	977 次
最近记录：	9 年，2 月前