Sam*_*Sam 143 html php regex html-parsing html-content-extraction
我想创建一个页面,其中所有驻留在我网站上的图像都列有标题和替代表示.
我已经给我写了一个程序来查找和加载所有HTML文件,但现在我被困在如何提取src,title并alt从这个HTML:
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />Run Code Online (Sandbox Code Playgroud)
我想这应该用一些正则表达式完成,但由于标签的顺序可能会有所不同,而且我需要所有这些,我真的不知道如何以优雅的方式解析它(我可以通过char方式,但这很痛苦).
小智 245
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}
Run Code Online (Sandbox Code Playgroud)
e-s*_*tis 192
使用regexp来解决这类问题是一个坏主意,可能导致代码难以维护和不可靠.更好地使用HTML解析器.
在这种情况下,最好将流程分为两部分:
我将假设您的doc不是xHTML严格的,因此您不能使用XML解析器.EG与此网页源代码:
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
Run Code Online (Sandbox Code Playgroud)
然后我们用循环获取所有img标记属性:
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
Run Code Online (Sandbox Code Playgroud)
Regexp是CPU密集型的,因此您可能希望缓存此页面.如果您没有缓存系统,则可以使用ob_start并从文本文件加载/保存来调整自己的缓存系统.
首先,我们使用preg_ match_ all,这是一个函数,它获取与模式匹配的每个字符串,并将其输出到第三个参数中.
正则表达式:
<img[^>]+>
Run Code Online (Sandbox Code Playgroud)
我们将其应用于所有html网页.它可以读作每个以" <img" 开头的字符串,包含非">"字符,以>结尾.
(alt|title|src)=("[^"]*")
Run Code Online (Sandbox Code Playgroud)
我们在每个img标签上连续应用它.它可以被读作每个以"alt","title"或"src"开头的字符串,然后是"=",然后是"",一堆不是'''的东西,以'''结尾.在()之间隔离子字符串.
最后,每次你想要处理regexp时,都可以使用好的工具来快速测试它们.检查这个在线regexp测试人员.
编辑:回答第一条评论.
确实,我没有想到(希望很少)人使用单引号.
好吧,如果你只使用',只需替换所有"by".
如果你混合两者.首先你应该打自己:-),然后尝试使用("|")代替或"和[^ø]来代替[^"].
Ste*_*rig 64
只是举一个使用PHP的XML功能来完成任务的小例子:
$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}
Run Code Online (Sandbox Code Playgroud)
我确实使用了该DOMDocument::loadHTML()方法,因为此方法可以处理HTML语法,并且不会强制输入文档为XHTML.严格地说,转换为a SimpleXMLElement是不必要的 - 它只是使用xpath并且xpath结果更简单.
如果它是XHTML,那么你的例子是,你只需要simpleXML.
<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>
Run Code Online (Sandbox Code Playgroud)
输出:
object(SimpleXMLElement)#1 (1) {
["@attributes"]=>
array(3) {
["src"]=>
string(22) "/image/fluffybunny.jpg"
["title"]=>
string(16) "Harvey the bunny"
["alt"]=>
string(26) "a cute little fluffy bunny"
}
}
Run Code Online (Sandbox Code Playgroud)
我使用 preg_match 来做到这一点。
就我而言,我有一个字符串,其中包含<img>我从 Wordpress 获得的一个标签(没有其他标记),我试图获取该src属性,以便我可以通过 timthumb 运行它。
// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);
// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);
Run Code Online (Sandbox Code Playgroud)
在抓取标题或 alt 的模式中,您可以简单地使用$pattern = '/title="([^"]*)"/';抓取标题或$pattern = '/title="([^"]*)"/';抓取 alt。可悲的是,我的正则表达式不够好,无法一次性获取所有三个(alt/title/src)。
你可以使用simplehtmldom.simplehtmldom支持大多数jQuery选择器.下面给出一个例子
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
305295 次 |
| 最近记录: |