按类和标签名称的网页抓取元素

kir*_*wad 5 html excel vba screen-scraping web-scraping

我正在尝试从下面提到的网站复制数据,我需要各种尺寸、价格、便利设施、特价、预订。我在代码下方构图,但我能够正确复制元素。第一件事只有三个元素在处理重复,我也没有得到 Amenities 和 Reserve 的结果。有人可以看看这个吗?

Sub text()


Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
    .Visible = True
    .Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955" 

    While .Busy Or .readyState < 4: DoEvents: Wend

    Sheets("Unit Data").Select


    Dim listings As Object, listing As Object, headers(), results()
    Dim r As Long, list As Object, item As Object
    headers = Array("size", "features", "Specials", "Price", "Reserve")
    Set list = .document.getElementsByClassName("units_table")
    '.unit_size medium, .features, .Specials, .price, .Reserve
    Dim rowCount As Long
    rowCount = .document.querySelectorAll(".tab_container li").Length
    ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
    For Each listing In list
            For Each item In listing.getElementsByClassName("unitinfo even")
            r = r + 1

          results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
          results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
           results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
        results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
          results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText





        Next
    Next
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    .Quit
End With

  Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub
Run Code Online (Sandbox Code Playgroud)

QHa*_*arr 2

太长;博士;

\n

提前(向某些人)对答案的长度表示歉意,但我想我会利用这个教学时刻来详细说明正在发生的事情。

\n

我使用的总体方法与您的代码中相同:找到一个 css 选择器来隔离行(尽管位于不同的选项卡中,小、中、大实际上仍然全部出现在页面上):

\n
Set listings = html.querySelectorAll(".unitinfo")\n
Run Code Online (Sandbox Code Playgroud)\n

上面生成了行。和以前一样,我们将其转储到新的中,HTMLDocument以便我们可以利用querySelector/querySelectorAll方法。

\n
\n

行:

\n

让我们看一下我们正在检索的第一行 html。后续部分将以这一行作为案例研究来讨论如何检索信息:

\n

\r\n
\r\n
Set listings = html.querySelectorAll(".unitinfo")\n
Run Code Online (Sandbox Code Playgroud)\r\n
\r\n
\r\n

\n

我们要处理的每一行变量内部都会有类似的 html html2。如果您有疑问,请查看上面所示函数中的 javascript:

\n
$(\'.units_table tr.unitinfo\').each(function(index, el) \n
Run Code Online (Sandbox Code Playgroud)\n

它使用相同的选择器(但还指定父表类和元素类型 ( tr))。基本上,为表中的每一行调用该函数。

\n
\n

尺寸:

\n

由于某种原因,开始td标签被删除(我<table>认为我已经看到缺少父标签的情况),因此对于大小,我不是按类抓取,而是寻找结束标签的开头并提取字符串到那里。我通过将Instr给出的返回值(其中在字符串中找到 <)-1 传递给Left$(类型化)函数来实现此目的。

\n

在此输入图像描述

\n
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)\n
Run Code Online (Sandbox Code Playgroud)\n

这样就返回了5x5

\n
\n

描述:

\n

描述列由我们上面看到的函数填充(记住,该函数应用于每一行)

\n

这个位 - $("#a5x5-1").tooltip- 告诉它目标在哪里,然后函数的 return 语句提供了 html,其中有一个div, 和 class description,其中包含我们想要的文本。由于我们没有使用浏览器,并且我使用的是 64 位 Windows,因此我无法评估此脚本,但我可以使用它来提取结束关联标记的split开头和开头之间的字符串(描述):"description\\">div

\n
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\\"">")(1), "</div>")(0)\n
Run Code Online (Sandbox Code Playgroud)\n

这将返回:

\n

“把它想象成一个标准的壁橱。这个空间大约 25 平方英尺,非常适合放置大约十二个盒子、一张桌子、一把椅子和一辆自行车。”

\n
\n

费率类型和价格:

\n

这些很简单,并使用类名来定位:

\n
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)\nresults(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)\n
Run Code Online (Sandbox Code Playgroud)\n

返回(分别)

\n

网页速率,\n\xc2\xa339.00

\n
\n

设施:

\n

这就是事情有点棘手的地方。

\n

让我们重新检查上面显示的与设施相关的 html 行:

\n

\r\n
\r\n
5x5</TD> <TD class=features>\n<DIV id=a5x5-1 class="icon a5x5">\n<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>\n<DIV class=display>\n<P>More Information</P></DIV></DIV>\n<SCRIPT type=text/javascript>\n                // Refine Search\n                //\n                $(function() {\n                    $("#a5x5-1").tooltip({\n                        track: false,\n                        delay: 0,\n                        showURL: false,\n                        left: 5,\n                        top: 5,\n                        bodyHandler: function () {\n                            return "        <div class=\\"tooltip\\">            <div class=\\"tooltop\\"></div>            <div class=\\"toolmid clearfix\\">                <div class=\\"toolcontent\\">                    <div style=\\"text-align:center;width:100%\\">                        <img alt=\\"5 x 5 storage unit\\" src=\\"/core/resources/images/units/5x5.png\\" />                    </div>                    <div class=\\"display\\">5 x 5</div>                    <div class=\\"description\\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div>                </div>                <div class=\\"clearfix\\"></div>            </div>            <div class=\\"toolfoot\\"></div>            <div class=\\"clearfix\\"></div>        </div>        "}\n                    });\n                });\n        </SCRIPT>\n</TD><TD class=rates>\n<DIV class="discount_price secondary-color-text standard_price--left">\n<DIV class=price_text>Web Rate: </DIV>\n<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>\n<SCRIPT>\n$( document ).ready(function() {\n    $(\'.units_table tr.unitinfo\').each(function(index, el) {\n        if ($(this).find(\'.standard_price\').length != \'\' && $(this).find(\'.discount_price\').length != \'\') {\n            $(this).parents(\'.units_table\').addClass(\'both\');\n            $(this).addClass(\'also-both\');\n            $(this).find(\'.rates\').addClass(\'rates_two_column\');\n        }\n    });\n});\n</SCRIPT>\n</TD><TD class=amenities>\n<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>\n<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>\n<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>\n<DIV class=offer1>Call for Specials </DIV>\n<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>
Run Code Online (Sandbox Code Playgroud)\r\n
\r\n
\r\n

\n

我们可以看到父类td有一个 类amenities,它的子div元素具有复合类名;后者在每种情况下都用作便利设施类型的标识符,例如icon_climate

\n

当您将鼠标悬停在这些上时,页面上会显示工具提示信息:

\n

在此输入图像描述

\n

我们可以在实际页面的 html 中跟踪此工具提示的位置:

\n

在此输入图像描述

\n

当您将鼠标悬停在不同的便利设施上时,此内容会更新。

\n

长话短说(他在页面中间说道!),此内容是从服务器上的 php 文件更新的。我们可以请求文件并构造一个字典,将每个便利设施的类名映射到相关描述amenity_icon icon_climate(例如,当转换为适当的 css 选择器时,复合类需要将“”替换为“.” )。您可以在此处.amenity_icon.icon_climate浏览 php 文件。

\n

php 文件:

\n

让我们只看文件的开头,以便剖析什么是重复模式的基本单元:

\n

\r\n
\r\n
$(\'.units_table tr.unitinfo\').each(function(index, el) \n
Run Code Online (Sandbox Code Playgroud)\r\n
\r\n
\r\n

\n

负责更新工具提示的函数是LoadTooltips。CSS 类选择器用于定位每个图标:

\n
$(".units_table .amenity_icon.icon_climate").tooltip\n
Run Code Online (Sandbox Code Playgroud)\n

我们有指定返回文本的 bodyhandler:

\n

\r\n
\r\n
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)\n
Run Code Online (Sandbox Code Playgroud)\r\n
\r\n
\r\n

\n

我们有重复组中出现的三位有用信息。元素的类名选择器、短描述和长描述,例如

\n
    \n
  1. .amenity_icon.icon_climate:我们使用它来将 php 文件描述映射到我们行中便利设施图标的类名。CSS 选择器
  2. \n
  3. Temperature Controlled; 工具提示函数的内部h4标记返回文本。简短的介绍
  4. \n
  5. Units are heated and/or cooled. See manager for details.; 工具提示函数的内部p标记返回文本。详细描述
  6. \n
\n

我编写了 2 个函数GetMatchesGetAmenitiesDescriptions,它们使用正则表达式为每个图标提取所有重复项,并返回一个字典,其中 css 选择器作为键,短字符description : long description作为值。

\n

当我收集每行中的所有图标时:

\n
Set icons = html2.querySelectorAll(".amenity_icon")\n        \n
Run Code Online (Sandbox Code Playgroud)\n

我使用字典根据图标的类名返回工具提示描述

\n
For icon = 0 To icons.Length - 1 \'use class name of amenity to look up description\n    amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))\nNext        \n
Run Code Online (Sandbox Code Playgroud)\n

然后,我将描述加入到vbNewLine以确保输出位于输出单元格内的不同行上。

\n

您可以在此处探索正则表达式。

\n

正则表达式使用|(Or) 语法,因此我在单个列表中返回所有匹配的模式。

\n
arr = GetMatches(re, s, "(\\.amenity_icon\\..*)""|<h4>(.*)<\\/h4>|<p>(.*)<\\/p>")\n
Run Code Online (Sandbox Code Playgroud)\n

因为我需要不同的子匹配(0,1 或 2 又名 css 类选择器、短 desc、长 desc),所以我使用Select Case i mod 3, 和计数器变量i来提取适当的子匹配。

\n

php 文件中映射的匹配示例:

\n

在此输入图像描述

\n
\n

特价:

\n

我们回到类选择器。Offer2未填充,因此您可以删除。

\n
results(r, 6) = html2.querySelector(".offer1").innerText\nresults(r, 7) = html2.querySelector(".offer2").innerText\n
Run Code Online (Sandbox Code Playgroud)\n

返回(分别):

\n

特价商品请致电,空字符串

\n
\n

结束语:

\n

因此,上面将引导您完成一行。只需冲洗并在所有行的循环中重复即可。为了提高效率,将数据添加到数组中resultsSheet1然后一次性写入。我可以看到一些小的改进,但速度很快。

\n
\n

编程语言:

\n
Option Explicit\nPublic Sub GetInfo()\n    Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object\n    Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"\n\n    Set ws = ThisWorkbook.Worksheets("Sheet1")\n    Set html = New HTMLDocument\n    Set amenitiesDescriptions = GetAmenitiesDescriptions\n    \n    With CreateObject("MSXML2.XMLHTTP")\n        .Open "GET", URL, False\n        .setRequestHeader "User-Agent", "Mozilla/5.0"\n        .send\n        s = .responseText\n\n        html.body.innerHTML = s\n\n        Dim headers(), results(), listings As Object, amenities As String\n\n        headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")\n        Set listings = html.querySelectorAll(".unitinfo")\n\n        Dim rowCount As Long, numColumns As Long, r As Long, c As Long\n        Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long\n\n        rowCount = listings.Length\n        numColumns = UBound(headers) + 1\n\n        ReDim results(1 To rowCount, 1 To numColumns)\n        Dim html2 As HTMLDocument\n        Set html2 = New HTMLDocument\n        For item = 0 To listings.Length - 1\n            r = r + 1\n            html2.body.innerHTML = listings.item(item).innerHTML\n            results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)\n            results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\\"">")(1), "</div>")(0)\n            results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)\n            results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)\n            \n            Set icons = html2.querySelectorAll(".amenity_icon")\n            ReDim amenitiesInfo(0 To icons.Length - 1)\n            \n            For icon = 0 To icons.Length - 1 \'use class name of amenity to look up description\n                amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))\n            Next\n\n            amenities = Join$(amenitiesInfo, vbNewLine) \'place each amenity description on a new line within cell when written out\n\n            results(r, 5) = amenities\n            results(r, 6) = html2.querySelector(".offer1").innerText\n            results(r, 7) = html2.querySelector(".offer2").innerText\n        Next\n\n        ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers\n        ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results\n    End With\nEnd Sub\n\nPublic Function GetAmenitiesDescriptions() As Object \'retrieve amenities descriptions from php file on server\n    Dim s As String, dict As Object, re As Object, i As Long, arr() \'keys based on classname, short desc, full desc\n    \' view regex here: https://regex101.com/r/bII5AL/1\n    Set dict = CreateObject("Scripting.Dictionary")\n    Set re = CreateObject("vbscript.regexp")\n    \n    With CreateObject("MSXML2.XMLHTTP")\n        .Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False\n        .setRequestHeader "User-Agent", "Mozilla/5.0"\n        .send\n        s = .responseText\n        \n        arr = GetMatches(re, s, "(\\.amenity_icon\\..*)""|<h4>(.*)<\\/h4>|<p>(.*)<\\/p>")\n        For i = LBound(arr) To UBound(arr) Step 3  \'build up lookup dictionary for amenities descriptions\n            dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)\n        Next\n    End With\n    Set GetAmenitiesDescriptions = dict\nEnd Function\n\nPublic Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant\n    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long\n\n    With re\n        .Global = True\n        .MultiLine = True\n        .IgnoreCase = False\n        .Pattern = sPattern\n        If .test(inputString) Then\n            Set matches = .Execute(inputString)\n            ReDim arrMatches(0 To matches.Count - 1)\n            For Each iMatch In matches\n                Select Case i Mod 3\n                Case 0\n                    arrMatches(i) = iMatch.SubMatches.item(0)\n                Case 1\n                    arrMatches(i) = iMatch.SubMatches.item(1)\n                Case 2\n                    arrMatches(i) = iMatch.SubMatches.item(2)\n                End Select\n                i = i + 1\n            Next iMatch\n        Else\n            ReDim arrMatches(0)\n            arrMatches(0) = vbNullString\n        End If\n    End With\n    GetMatches = arrMatches\nEnd Function\n
Run Code Online (Sandbox Code Playgroud)\n
\n

输出:

\n

在此输入图像描述

\n
\n

参考资料(VBE > 工具 > 参考资料):

\n
    \n
  1. 微软HTML对象库
  2. \n
\n