如何使用PHP从html中提取img src,title和alt?

Sam*_*Sam 143 html php regex html-parsing html-content-extraction

我想创建一个页面,其中所有驻留在我网站上的图像都列有标题和替代表示.

我已经给我写了一个程序来查找和加载所有HTML文件,但现在我被困在如何提取src,titlealt从这个HTML:

<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
Run Code Online (Sandbox Code Playgroud)

我想这应该用一些正则表达式完成,但由于标签的顺序可能会有所不同,而且我需要所有这些,我真的不知道如何以优雅的方式解析它(我可以通过char方式,但这很痛苦).

小智 245

$url="http://example.com";

$html = file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');

foreach ($tags as $tag) {
       echo $tag->getAttribute('src');
}
Run Code Online (Sandbox Code Playgroud)

  • 我喜欢这是多么容易阅读!xpath和regex也有效,但18个月之后再也不容易阅读了. (5认同)

e-s*_*tis 192

编辑:现在我知道的更好

使用regexp来解决这类问题是一个坏主意,可能导致代码难以维护和不可靠.更好地使用HTML解析器.

使用正则表达式解决方案

在这种情况下,最好将流程分为两部分:

  • 得到所有的img标签
  • 提取元数据

我将假设您的doc不是xHTML严格的,因此您不能使用XML解析器.EG与此网页源代码:

/* preg_match_all match the regexp in all the $html string and output everything as 
an array in $result. "i" option is used to make it case insensitive */

preg_match_all('/<img[^>]+>/i',$html, $result); 

print_r($result);
Array
(
    [0] => Array
        (
            [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
            [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
            [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
            [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
            [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />

[...]
        )

)
Run Code Online (Sandbox Code Playgroud)

然后我们用循环获取所有img标记属性:

$img = array();
foreach( $result as $img_tag)
{
    preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}

print_r($img);

Array
(
    [<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/Content/Img/stackoverflow-logo-250.png"
                    [1] => alt="logo link to homepage"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                )

            [2] => Array
                (
                    [0] => "/Content/Img/stackoverflow-logo-250.png"
                    [1] => "logo link to homepage"
                )

        )

    [<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/content/img/vote-arrow-up.png"
                    [1] => alt="vote up"
                    [2] => title="This was helpful (click again to undo)"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                    [2] => title
                )

            [2] => Array
                (
                    [0] => "/content/img/vote-arrow-up.png"
                    [1] => "vote up"
                    [2] => "This was helpful (click again to undo)"
                )

        )

    [<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/content/img/vote-arrow-down.png"
                    [1] => alt="vote down"
                    [2] => title="This was not helpful (click again to undo)"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                    [2] => title
                )

            [2] => Array
                (
                    [0] => "/content/img/vote-arrow-down.png"
                    [1] => "vote down"
                    [2] => "This was not helpful (click again to undo)"
                )

        )

    [<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
        (
            [0] => Array
                (
                    [0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                    [1] => alt="gravatar image"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                )

            [2] => Array
                (
                    [0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                    [1] => "gravatar image"
                )

        )

   [..]
        )

)
Run Code Online (Sandbox Code Playgroud)

Regexp是CPU密集型的,因此您可能希望缓存此页面.如果您没有缓存系统,则可以使用ob_start并从文本文件加载/保存来调整自己的缓存系统.

这个东西怎么样?

首先,我们使用preg_ match_ all,这是一个函数,它获取与模式匹配的每个字符串,并将其输出到第三个参数中.

正则表达式:

<img[^>]+>
Run Code Online (Sandbox Code Playgroud)

我们将其应用于所有html网页.它可以读作每个以" <img" 开头的字符串,包含非">"字符,以>结尾.

(alt|title|src)=("[^"]*")
Run Code Online (Sandbox Code Playgroud)

我们在每个img标签上连续应用它.它可以被读作每个以"alt","title"或"src"开头的字符串,然后是"=",然后是"",一堆不是'''的东西,以'''结尾.在()之间隔离子字符串.

最后,每次你想要处理regexp时,都可以使用好的工具来快速测试它们.检查这个在线regexp测试人员.

编辑:回答第一条评论.

确实,我没有想到(希望很少)人使用单引号.

好吧,如果你只使用',只需替换所有"by".

如果你混合两者.首先你应该打自己:-),然后尝试使用("|")代替或"和[^ø]来代替[^"].

  • 我建议向下滚动到karim的答案(没有正则表达式,清除代码) (6认同)

Ste*_*rig 64

只是举一个使用PHP的XML功能来完成任务的小例子:

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
    echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}
Run Code Online (Sandbox Code Playgroud)

我确实使用了该DOMDocument::loadHTML()方法,因为此方法可以处理HTML语法,并且不会强制输入文档为XHTML.严格地说,转换为a SimpleXMLElement是不必要的 - 它只是使用xpath并且xpath结果更简单.


Dre*_*erx 8

如果它是XHTML,那么你的例子是,你只需要simpleXML.

<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>
Run Code Online (Sandbox Code Playgroud)

输出:

object(SimpleXMLElement)#1 (1) {
  ["@attributes"]=>
  array(3) {
    ["src"]=>
    string(22) "/image/fluffybunny.jpg"
    ["title"]=>
    string(16) "Harvey the bunny"
    ["alt"]=>
    string(26) "a cute little fluffy bunny"
  }
}
Run Code Online (Sandbox Code Playgroud)


Bak*_*dan 5

必须像这样编辑脚本

foreach( $result[0] as $img_tag)

因为preg_match_all返回数组数组


WNR*_*erg 5

我使用 preg_match 来做到这一点。

就我而言,我有一个字符串,其中包含<img>我从 Wordpress 获得的一个标签(没有其他标记),我试图获取该src属性,以便我可以通过 timthumb 运行它。

// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);

// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);
Run Code Online (Sandbox Code Playgroud)

在抓取标题或 alt 的模式中,您可以简单地使用$pattern = '/title="([^"]*)"/';抓取标题或$pattern = '/title="([^"]*)"/';抓取 alt。可悲的是,我的正则表达式不够好,无法一次性获取所有三个(alt/title/src)。

  • 如果 img 标签属性在单引号中,则不起作用;`&lt;img src='image.png'&gt;` (2认同)

Nau*_*hal 5

你可以使用simplehtmldom.simplehtmldom支持大多数jQuery选择器.下面给出一个例子

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 
Run Code Online (Sandbox Code Playgroud)