Goutte - 获取日期在顶部、标题在下面的列表

Ann*_*lee 2 php goutte domcrawler

我在用"fabpot/goutte": "^4.0",

我正在尝试从该网站获取数组中的日期和发布版本。

请找到我的可运行示例:

<?php

require_once '../vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;
use Goutte\Client;

try {

    $resArr = array();
    $tempArr = array();

    $url = "https://www.steelcitycollectibles.com/product-release-calendar";

    // get page
    $client = new Client();
    $content = $client->request('GET', $url)->html();
    $crawler = new Crawler($content, null, null);

    $table = $crawler->filter('#schedule'); //->first()->closest('table');

    $index = 0;
    $resArr = array();
    $table->filter('div')
        ->each(function (Crawler $tr) use (&$index, &$resArr) {

            if ($tr->filter('.schedule-date')->count() > 0) {
                $releaseDate = $tr->filter('.schedule-date')->text();
            }

            if ($tr->filter('div > div.eight.columns > a')->count() > 0) {
                $releaseStr = $tr->filter('div > div.eight.columns > a')->text();
                array_push($resArr, [$releaseDate, $releaseStr]);
            }

        });

    var_dump($resArr);
} catch (Exception $e) {}
Run Code Online (Sandbox Code Playgroud)

但是,我没有得到每个项目的正确日期:

在此输入图像描述

对于空值,我想添加正确的日期。在这种情况下12/20/21

mik*_*n32 5

假设您想要将最近看到的日期应用于数组的每个元素,您只需设置一个默认值,然后在循环中更新它。这必须是另一次引用传递,因为匿名函数状态在每次传递时都会重置。

<?php

require_once '../vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;
use Goutte\Client;

try {

    $resArr = [];

    $content = <<< HTML
<div id="schedule" class="schedule nine columns">
    <div class="schedule-date">12/22/21</div>
    <div class="schedule-list clear">
        <div class="eight columns">
            <a href="xxx" class="schedule-product-title ">2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case</a>
        </div>
        <div class="schedule-notify three columns">
            <release-schedule-notify type="'release'"/>
        </div>
    </div>
    <div class="schedule-list clear">
        <div class="eight columns">
            <a href="xxx" class="schedule-product-title ">2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 Box</a>
        </div>
        <div class="schedule-notify three columns">
            <release-schedule-notify type="'release'"/>
        </div>
    </div>
    <div class="schedule-date">12/24/21</div>
    <div class="schedule-list clear">
        <div class="eight columns">
            <a href="xxx">2021 Panini Flawless Baseball Hobby 2-Box Case</a>
        </div>
        <div class="schedule-notify three columns">
            <release-schedule-notify type="'release'"/>
        </div>
    </div>
    <div class="schedule-list clear">
        <div class="eight columns">
            <a href="xxx">2021 Panini Flawless Baseball Hobby Box</a>
        </div>
        <div class="schedule-notify three columns">
            <release-schedule-notify type="'release'"/>
        </div>
    </div>
HTML;

    $crawler = new Crawler($content, null, null);

    $table = $crawler->filter('#schedule');

    // use today's date as a default, in case first one is missing
    $releaseDate = (new DateTime())->format("m/d/y");
    $table->filter('div')
        ->each(function (Crawler $tr) use (&$index, &$resArr, &$releaseDate) {
            if ($tr->filter('.schedule-date')->count() > 0) {
                // update the date if it exists, otherwise continue with the old one
                $releaseDate = $tr->filter('.schedule-date')->text();
            }
            if ($tr->filter('div > div.eight.columns > a')->count() > 0) {
                $releaseStr = $tr->filter('div > div.eight.columns > a')->text();
                $resArr[] = [$releaseDate, $releaseStr];
            }
        });
} catch (Exception $e) {}

echo json_encode($resArr, JSON_PRETTY_PRINT);
Run Code Online (Sandbox Code Playgroud)

输出:

[
    [
        "12\/22\/21",
        "2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case"
    ],
    [
        "12\/22\/21",
        "2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case"
    ],
    [
        "12\/22\/21",
        "2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 Box"
    ],
    [
        "12\/24\/21",
        "2021 Panini Flawless Baseball Hobby 2-Box Case"
    ],
    [
        "12\/24\/21",
        "2021 Panini Flawless Baseball Hobby Box"
    ]
]
Run Code Online (Sandbox Code Playgroud)

附带说明一下,Goutte 的文档说该request()方法返回一个Crawler对象。您不必要地提取 HTML 并Crawler手动创建对象。将您的代码更改为:

// get page
$crawler = (new Client)->request('GET', $url);
Run Code Online (Sandbox Code Playgroud)