我一直在尝试使用网络爬虫将站点表文本及其从给定表到(在site1.com中)的链接提取到我的php页面.
但不幸的是,由于php代码中的Array索引输入不正确,因此输出错误.
site1.com
<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
<td width="1%" valign="top" class="Title2"> </td>
<td width="65%" valign="top" class="Title2">Subject</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="8%" valign="top" align="Center" class="Title2">Replies</td>
<td width="1%" valign="top" class="Title2"> </td>
<td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
<td width="1%" height="25"> </td>
<td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
<td width="1%" height="25"> </td>
<td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
<td width="1%" height="25"> </td>
<td width="8%" height="25" align="Center" class="FootNotes2">10</td>
<td width="1%" height="25"> </td>
<td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>
Run Code Online (Sandbox Code Playgroud)
PHP.网络爬虫为::
<?php
function get_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
curl_close($ch);
return $result;
}
$returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
$first_step = explode( '<table class="Table2">' , $returned_content );
$second_step = explode('</table>', $first_step[0]);
$third_step = explode('<tr>', $second_step[1]);
// print_r($third_step);
foreach ($third_step as $key=>$element) {
$child_first = explode( '<td class="FootNotes2"' , $element );
$child_second = explode( '</td>' , $child_first[1] );
$child_third = explode( '<a href=' , $child_second[0] );
$child_fourth = explode( '</a>' , $child_third[0] );
$final = "<a href=".$child_fourth[0]."</a></br>";
?>
<li target="_blank" class="itemtitle">
<?php echo $final?>
</li>
<?php
if($key==10){
break;
}
}
?>
Run Code Online (Sandbox Code Playgroud)
现在上面的PHP代码上的数组索引可能是罪魁祸首.(我猜)如果是这样,有人可以解释一下如何使这项工作.
但是,我对此代码的最终要求是::在第二个文本中使用与之关联的链接获取上述文本.
任何帮助都是赞赏的..
jkr*_*nak 10
您可以使用Symfony的DomCrawler组件之类的现有解决方案,而不是编写自己的解析器解决方案:http://symfony.com/doc/current/components/dom_crawler.html
$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
return $node->text();
});
Run Code Online (Sandbox Code Playgroud)
或者,如果你想自己遍历DOM树,你可以使用DOMDocument的loadHTML
http://php.net/manual/en/domdocument.loadhtml.php
$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
$text = $link->nodeValue;
}
Run Code Online (Sandbox Code Playgroud)
编辑:
要获取所需的链接,代码假定您有一个$returned_content包含要解析的HTML 的变量.
// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** @var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
$parentNode = $node->parentNode;
// checking if the <a> is under a <td> that has class="FootNotes2"
$isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
// checking if the <a> has class="Links2"
$isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
// as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
$links[] = [
'href' => $node->getAttribute('href'),
'text' => $parentNode->textContent,
];
}
}
print_r($links);
Run Code Online (Sandbox Code Playgroud)
这将创建一个类似于的数组:
Array
(
[0] => Array
(
[href] => /files/forum/2017/1/837242.php
[text] => Q@Q Drill Time ? - cardio69
)
[1] => Array
(
[href] => /files/forum/2017/1/837356.php
[text] => study partner in Houston - lacy
)
[2] => Array
(
[href] => /files/forum/2017/1/837110.php
[text] => Serious dedicated study partner for U World - step12013
)
...
Run Code Online (Sandbox Code Playgroud)
使用简单 HTML DOM 解析器库,您可以使用以下代码:
<?php
require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
$element->href = "http://www.usmleforum.com" . $element->href; // you can also access only certain attributes of the elements (e.g. the url).
echo $element.'</br>'; // do something with the elements.
}
?>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
548 次 |
| 最近记录: |