使用XPath远程抓取页面并获取图像的大多数相关标题或描述

Question

使用XPath远程抓取页面并获取图像的大多数相关标题或描述

stw*_*ite 5 php xpath facebook html-parsing scrape

我正在做的事情基本上与Tweet按钮或Facebook Share/Like按钮相同,那就是为一段数据刮取页面和最相关的标题.我能想到的最好的例子就是当你在一个有许多文章的网站的首页上,然后点击Facebook Like Button.然后,它将获得相对于(最近)Like按钮的帖子的正确信息.有些网站有Open Graph标签,但有些网站没有,它仍然有效.

由于这是远程完成的,我只能控制我想要定位的数据.在这种情况下,数据是图像.我不是只检索<title>页面的内容,而是希望以某种方式从每个图像的起点反向遍历dom,并找到最近的"标题".问题是并非所有标题都出现在图像之前.然而,在这种情况下,在标题之后出现图像的可能性似乎相当高.话虽如此,我希望能让它几乎适用于任何网站.

思考:

找到图像的"容器",然后使用第一个文本块.
在包含某些类("description","title")或元素(h1,h2,h3,h4)的元素中查找文本块.

标题备份:

使用开放图形标签
只使用 <title>
仅使用ALT标签
使用META标签

摘要:提取图像不是问题,而是如何为它们获取相关标题.

问题:您如何获得每张图片的相关标题？也许使用DomDocument或XPath？

Answer 1

Ali*_*xel 1

你的方法似乎足够好，我只是给某些标签/属性一个权重，并使用 XPath 查询循环它们，直到找到存在的东西并且它不是无效的。就像是：

i = 0

while (//img[i][@src])
  if (//img[i][@alt])
    return alt
  else if (//img[i][@description])
    return description
  else if (//img[i]/../p[0])
    return p
  else
    return (//title)

  i++

Run Code Online (Sandbox Code Playgroud)

一个简单的 XPath 示例（从我的框架移植的函数）：

function ph_DOM($html, $xpath = null)
{
    if (is_object($html) === true)
    {
        if (isset($xpath) === true)
        {
            $html = $html->xpath($xpath);
        }

        return $html;
    }

    else if (is_string($html) === true)
    {
        $dom = new DOMDocument();

        if (libxml_use_internal_errors(true) === true)
        {
            libxml_clear_errors();
        }

        if ($dom->loadHTML(ph()->Text->Unicode->mb_html_entities($html)) === true)
        {
            return ph_DOM(simplexml_import_dom($dom), $xpath);
        }
    }

    return false;
}

Run Code Online (Sandbox Code Playgroud)

以及实际使用情况：

$html = file_get_contents('http://en.wikipedia.org/wiki/Photography');

print_r(ph_DOM($html, '//img')); // gets all images
print_r(ph_DOM($html, '//img[@src]')); // gets all images that have a src
print_r(ph_DOM($html, '//img[@src]/..')); // gets all images that have a src and their parent element
print_r(ph_DOM($html, '//img[@src]/../..')); // and so on...
print_r(ph_DOM($html, '//title')); // get the title of the page

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，9 月前
查看次数：	1040 次
最近记录：	13 年，9 月前