将所有html标签拆分成一个数组

Question

将所有html标签拆分成一个数组

假设我有下面的代码：

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title>
</head>    
<body>
<div id="x">Hello</div>
<p>world</p>
<h1>my name</h1>
</body>
</html>

Run Code Online (Sandbox Code Playgroud)

我需要提取所有 html 标签并将其放入一个数组中，如下所示：

'0' => '<!DOCTYPE html>',
'1' => '<html>',
'2' => '<head>',
'3' => '<meta charset="UTF-8">',
'4' => '<title>Title of the document</title>',
'5' => '</head>',
'6' => '<body>',
'7' => '<div id="x">Hello</div>',
'8' => '<p>world</p>',
'9' => '<h1>my name</h1>',
....

Run Code Online (Sandbox Code Playgroud)

就我而言，我不需要获取标签内的所有现有内容，对我来说，只捕获每个标签的开头就已经很好了。

我怎样才能做到这一点？

Answer 1

Rom*_*est 5

使用以下具有preg_match_all功能的解决方案：

$html_content = '<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title>
</head>    
<body>
<div id="x">Hello</div>
<p>world</p>
<h1>my name</h1>
</body>
</html>';

preg_match_all("/\<\w[^<>]*?\>([^<>]+?\<\/\w+?\>)?|\<\/\w+?\>/i", $html_content, $matches);
// <!DOCTYPE html> is standardized document type definition and is not a tag

print_r($matches[0]);

Run Code Online (Sandbox Code Playgroud)

输出：

Array
(
    [0] => <html>
    [1] => <head>
    [2] => <meta charset="UTF-8">
    [3] => <title>Title of the document</title>
    [4] => </head>
    [5] => <body>
    [6] => <div id="x">Hello</div>
    [7] => <p>world</p>
    [8] => <h1>my name</h1>
    [9] => </body>
    [10] => </html>
)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，6 月前
查看次数：	2693 次
最近记录：	9 年，6 月前