标签: domdocument

DOMDocument PHP 网页抓取

我想知道是否有任何方法可以使用 dom 来选择具有动态标签的元素。所有的标签都以 link_(some id) 开头。

例子：

<tr id="link_111111">something in here...</tr>

<tr id="link_222222">something in here...</tr>

<tr id="link_333333">something in here...</tr>

<tr id="link_444444">something in here...</tr>

<tr id="link_555555">something in here...</tr>

Run Code Online (Sandbox Code Playgroud)

我想知道是否可以通过 link_ 获取所有具有 id 的 tr，因为我没有特定的 id，它们是随机的。

php domdocument web-scraping

Use*_*ame

lucky-day

0
推荐指数

1
解决办法

772
查看次数

Xpath 查询不会返回结果

我试图从 Xpath 查询返回一些结果，但它不会正确选择元素。我正在使用以下代码：

public function getTrustPilotReviews($amount)
{
    $trustPilotUrl = 'https://www.trustpilot.co.uk/review/purplegriffon.com';
    $html5 = new HTML5;
    $document = $html5->loadHtml(file_get_contents($trustPilotUrl));
    $document->validateOnParse = true;
    $xpath = new DOMXpath($document);
    $reviewsDomNodeList = $xpath->query('//div[@id="reviews-container"]//div[@itemprop="review"]');
    $reviews = new Collection;

    foreach ($reviewsDomNodeList as $key => $reviewDomElement)
    {
        $xpath = new DOMXpath($reviewDomElement->ownerDocument);

        if ((int) $xpath->query('//*[@itemprop="ratingValue"]')->item($key)->getAttribute('content') >= 4)
        {
            $review = [
                'title'     => 'Test',
                'author'    => $xpath->query('//*[@itemprop="author"]')->item($key)->nodeValue,
                'date'      => $xpath->query('//*[@class="ndate"]')->item($key)->nodeValue,
                'rating'    => $xpath->query('//*[@itemprop="ratingValue"]')->item($key)->nodeValue,
                'body'      => $xpath->query('//*[@itemprop="reviewBody"]')->item($key)->nodeValue,
            ];

            $reviews->add((object) $review);
        }
    }

    return $reviews->take($amount);
}

Run Code Online (Sandbox Code Playgroud)

此代码不会返回任何内容：

//div[@id="reviews-container"]//div[@itemprop="review"]

Run Code Online (Sandbox Code Playgroud)

但如果我把它改成：

//*[@id="reviews-container"]//*[@itemprop="review"]

Run Code Online (Sandbox Code Playgroud)

它部分有效，但不会返回正确的结果。

php xpath domdocument domxpath

Gar*_*ine

2015 01-26

0
推荐指数

1
解决办法

1189
查看次数

在 PHP 中将 XML 转换为对象，然后再次将该对象转换为 XML？

假设我有一个如下的 XML -

$xml = '<?xml version="1.0"?>
<step number="9">
  <s_name>test</s_name>
  <b_sel>12345</b_sel>
  <b_ind>7</b_ind>
</step>';

Run Code Online (Sandbox Code Playgroud)

我希望将其转换为对象，但是当我执行以下步骤时，它给了我如下的 stdclass 对象 [我将它分配给 $stepInformation 变量] -

$xml = json_decode(json_encode((array) simplexml_load_string($xml)), 1);

$stepInformation = stdClass Object
(
    [@attributes] => Array
        (
            [number] => 9
        )

    [s_name] => test
    [b_sel] => 12345
    [b_ind] => 7
)

Run Code Online (Sandbox Code Playgroud)

所以当我在 php 函数中解析这个 stdclass 对象时

function convertStepInformationToArray($stepInformation)
{
     $dom = new DOMDocument();
    $stepInfo = "{$stepInformation->s_name}{$stepInformation->b_sel}{$stepInformation->b_ind}";    
$dom->loadXML("<document>" . $stepInfo . "</document>");
    $domx = new DOMXPath($dom);
    $entries = $domx->evaluate("//step");
    return $entries;
}

Run Code Online (Sandbox Code Playgroud)

我得到的输出是

DOMNodeList …

Run Code Online (Sandbox Code Playgroud)

php xml simplexml domdocument

San*_*dar

2015 04-24

0
推荐指数

1
解决办法

1万
查看次数

是否可以使用DomDocument创建一个元素,其中开始和结束标记是不同的？

我知道这听起来很奇怪,但我有一个特殊的情况,我需要创建一个xml(实际上这不会是一个xml文档,因为不同的结束标记),其中根开始标记与根结束标记不同.喜欢:

<OPENING_TAG>
   <ONE></ONE>
   <TWO></TWO>
   <THREE></THREE>
</CLOSING_TAG>

Run Code Online (Sandbox Code Playgroud)

我再次知道这是不对的,但正如我所提到的,这是一个特殊情况,其中开始和结束标签必须是不同的.如何使用DomDocument实现这一目标？

php xml domdocument

Ske*_*tor

lucky-day

0
推荐指数

1
解决办法

28
查看次数

如何在 PHP 中将 HTML 表转换为 JSON

我正在尝试将一些 html 转换为数组，然后转换为 json 字符串。

我正在基于此参考进行开发：https://www.codeproject.com/Tips/1074174/Simple-Way-to-Convert-HTML-Table-Data-into-PHP-Arr

这是基本的 table/html，我想将其转换为 JSON。

<table class="table-list table table-responsive table-striped">
<thead>
    <tr>
        <th class="coll-1 name">name</th>
        <th class="coll-2">height</th>
        <th class="coll-3">weight</th>
        <th class="coll-date">date</th>
        <th class="coll-4"><span class="info">info</span></th>
        <th class="coll-5">country</th>
    </tr>
</thead>
<tbody>
<tr>
    <td class="coll-1 name">
        <a href="/username/Jhon Doe/" class="icon"><i class="flaticon-user"></i></a>
        <a href="/username/Jhon Doe/">Jhon Doe</a>
    </td>
    <td class="coll-2 height">45</td>
    <td class="coll-3 weight">50</td>
    <td class="coll-date">9am May. 16th</td>
    <td class="coll-4 size mob-info">abcd</td>
    <td class="coll-5 country"><a href="/country/CA/">CA</a></td>
</tr>
<tr>
    <td class="coll-1 name">
        <a href="/username/Kasim Shk/" class="icon"><i class="flaticon-user"></i></a>
        <a href="/username/Kasim Shk/">Kasim Shk</a> …

Run Code Online (Sandbox Code Playgroud)

html php parsing json domdocument

Cha*_*jay

2018 05-26

0
推荐指数

1
解决办法

1万
查看次数

如何通过id PHP DOM获取所有元素

我试图使用循环打印,但它不打印下一个元素.
请给我一个简单的例子,它解释得最好.使用PHP DOM
其次,我可以获得一些具有该功能的Xth元素id.

更新

谢谢.在创建具有相同元素的元素时出现错误,id因此将其更改为class(抱歉是编程的新手,感谢振作起来).那么请你告诉我如何提取具有相同类名的所有元素,然后我可以从具有该类名的文档中获取一些Xth元素.

php dom getelementbyid domdocument

kri*_*hna

2013 06-24

-1
推荐指数

1
解决办法

2万
查看次数

将所有html标签拆分成一个数组

假设我有下面的代码：

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title>
</head>    
<body>
<div id="x">Hello</div>
<p>world</p>
<h1>my name</h1>
</body>
</html>

Run Code Online (Sandbox Code Playgroud)

我需要提取所有 html 标签并将其放入一个数组中，如下所示：

'0' => '<!DOCTYPE html>',
'1' => '<html>',
'2' => '<head>',
'3' => '<meta charset="UTF-8">',
'4' => '<title>Title of the document</title>',
'5' => '</head>',
'6' => '<body>',
'7' => '<div id="x">Hello</div>',
'8' => '<p>world</p>',
'9' => '<h1>my name</h1>',
....

Run Code Online (Sandbox Code Playgroud)

就我而言，我不需要获取标签内的所有现有内容，对我来说，只捕获每个标签的开头就已经很好了。

我怎样才能做到这一点？

php domdocument

Lac*_*ilm

lucky-day

-1
推荐指数

1
解决办法

2693
查看次数

PHP正则表达式匹配除某些src之外的所有img标记

我是PHP的新手,对用PHP编写的CMS中的某个文件进行了一些修改.我修改了一个<img>在页面源中获取第一个标记的函数,从该源获取随机标记.

用于匹配源的正则表达式是:

$regex = '/<' . $tag . '\\b[^>]*>/i';

Run Code Online (Sandbox Code Playgroud)

其中$tag只包含一个字符串img.

但是我注意到在源代码中有图像,其中src属性包含"1px.gif",我不想匹配这些.

目前我不断从匹配数组中重新选择一个随机元素,直到它不是1px.gif,但当然这是一个糟糕的解决方案.

我不能用正则表达式自己做这个,但我理解上面的正则表达式搜索<img和一个不是的单词字符>.我需要添加"并且不包含'1px.gif'".

我可以选择检查匹配数组并删除每个1px.gif的条目,但我更喜欢正则表达式.

php regex xpath html-parsing domdocument

Mar*_*oDS

2013 01-07

-2
推荐指数

1
解决办法

691
查看次数

PHP警告:DOMDocument :: loadHTML():htmlParseEntityRef:expecting';' 在实体

问题是:如果我从控制台运行脚本

C:\Users\Dima>php C:\wamp\www\shop\index.php "test" 2

Run Code Online (Sandbox Code Playgroud)

然后有一个错误:

PHP Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 802 in C:\wamp\www\rozetka\app.php on line 40

Run Code Online (Sandbox Code Playgroud)

的index.php

if(isset($argv[1]) && isset($argv[1])){
    return new App((string)$argv[1], (int)$argv[2]);
}

Run Code Online (Sandbox Code Playgroud)

如果只是在浏览器中运行index.php

$app = new App('test', 2);

Run Code Online (Sandbox Code Playgroud)

所以应用程序工作正常,没有错误请帮助从控制台启动应用程序,我很抱歉我的英语

php domdocument

作者

2013 01-13

-2
推荐指数

1
解决办法

6870
查看次数

标签统计

domdocument ×9

php ×9

xml ×2

xpath ×2

dom ×1

domxpath ×1

getelementbyid ×1

html ×1

html-parsing ×1

json ×1

parsing ×1

regex ×1

simplexml ×1

web-scraping ×1

标签 统计

标签统计