kir*_*wad 5 html excel vba screen-scraping web-scraping
我正在尝试从下面提到的网站复制数据,我需要各种尺寸、价格、便利设施、特价、预订。我在代码下方构图,但我能够正确复制元素。第一件事只有三个元素在处理重复,我也没有得到 Amenities 和 Reserve 的结果。有人可以看看这个吗?
Sub text()
Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
.Visible = True
.Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"
While .Busy Or .readyState < 4: DoEvents: Wend
Sheets("Unit Data").Select
Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features", "Specials", "Price", "Reserve")
Set list = .document.getElementsByClassName("units_table")
'.unit_size medium, .features, .Specials, .price, .Reserve
Dim rowCount As Long
rowCount = .document.querySelectorAll(".tab_container li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByClassName("unitinfo even")
r = r + 1
results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText
Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub
Run Code Online (Sandbox Code Playgroud)
太长;博士;
\n提前(向某些人)对答案的长度表示歉意,但我想我会利用这个教学时刻来详细说明正在发生的事情。
\n我使用的总体方法与您的代码中相同:找到一个 css 选择器来隔离行(尽管位于不同的选项卡中,小、中、大实际上仍然全部出现在页面上):
\nSet listings = html.querySelectorAll(".unitinfo")\nRun Code Online (Sandbox Code Playgroud)\n上面生成了行。和以前一样,我们将其转储到新的中,HTMLDocument以便我们可以利用querySelector/querySelectorAll方法。
行:
\n让我们看一下我们正在检索的第一行 html。后续部分将以这一行作为案例研究来讨论如何检索信息:
\nSet listings = html.querySelectorAll(".unitinfo")\nRun Code Online (Sandbox Code Playgroud)\r\n我们要处理的每一行变量内部都会有类似的 html html2。如果您有疑问,请查看上面所示函数中的 javascript:
$(\'.units_table tr.unitinfo\').each(function(index, el) \nRun Code Online (Sandbox Code Playgroud)\n它使用相同的选择器(但还指定父表类和元素类型 ( tr))。基本上,为表中的每一行调用该函数。
尺寸:
\n由于某种原因,开始td标签被删除(我<table>认为我已经看到缺少父标签的情况),因此对于大小,我不是按类抓取,而是寻找结束标签的开头并提取字符串到那里。我通过将Instr给出的返回值(其中在字符串中找到 <)-1 传递给Left$(类型化)函数来实现此目的。
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)\nRun Code Online (Sandbox Code Playgroud)\n这样就返回了5x5。
描述:
\n描述列由我们上面看到的函数填充(记住,该函数应用于每一行)
\n这个位 - $("#a5x5-1").tooltip- 告诉它目标在哪里,然后函数的 return 语句提供了 html,其中有一个div, 和 class description,其中包含我们想要的文本。由于我们没有使用浏览器,并且我使用的是 64 位 Windows,因此我无法评估此脚本,但我可以使用它来提取结束关联标记的split开头和开头之间的字符串(描述):"description\\">div
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\\"">")(1), "</div>")(0)\nRun Code Online (Sandbox Code Playgroud)\n这将返回:
\n“把它想象成一个标准的壁橱。这个空间大约 25 平方英尺,非常适合放置大约十二个盒子、一张桌子、一把椅子和一辆自行车。”
\n费率类型和价格:
\n这些很简单,并使用类名来定位:
\nresults(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)\nresults(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)\nRun Code Online (Sandbox Code Playgroud)\n返回(分别)
\n网页速率,\n\xc2\xa339.00
\n设施:
\n这就是事情有点棘手的地方。
\n让我们重新检查上面显示的与设施相关的 html 行:
\n5x5</TD> <TD class=features>\n<DIV id=a5x5-1 class="icon a5x5">\n<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>\n<DIV class=display>\n<P>More Information</P></DIV></DIV>\n<SCRIPT type=text/javascript>\n // Refine Search\n //\n $(function() {\n $("#a5x5-1").tooltip({\n track: false,\n delay: 0,\n showURL: false,\n left: 5,\n top: 5,\n bodyHandler: function () {\n return " <div class=\\"tooltip\\"> <div class=\\"tooltop\\"></div> <div class=\\"toolmid clearfix\\"> <div class=\\"toolcontent\\"> <div style=\\"text-align:center;width:100%\\"> <img alt=\\"5 x 5 storage unit\\" src=\\"/core/resources/images/units/5x5.png\\" /> </div> <div class=\\"display\\">5 x 5</div> <div class=\\"description\\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div> </div> <div class=\\"clearfix\\"></div> </div> <div class=\\"toolfoot\\"></div> <div class=\\"clearfix\\"></div> </div> "}\n });\n });\n </SCRIPT>\n</TD><TD class=rates>\n<DIV class="discount_price secondary-color-text standard_price--left">\n<DIV class=price_text>Web Rate: </DIV>\n<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>\n<SCRIPT>\n$( document ).ready(function() {\n $(\'.units_table tr.unitinfo\').each(function(index, el) {\n if ($(this).find(\'.standard_price\').length != \'\' && $(this).find(\'.discount_price\').length != \'\') {\n $(this).parents(\'.units_table\').addClass(\'both\');\n $(this).addClass(\'also-both\');\n $(this).find(\'.rates\').addClass(\'rates_two_column\');\n }\n });\n});\n</SCRIPT>\n</TD><TD class=amenities>\n<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>\n<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>\n<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>\n<DIV class=offer1>Call for Specials </DIV>\n<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>Run Code Online (Sandbox Code Playgroud)\r\n我们可以看到父类td有一个 类amenities,它的子div元素具有复合类名;后者在每种情况下都用作便利设施类型的标识符,例如icon_climate。
当您将鼠标悬停在这些上时,页面上会显示工具提示信息:
\n\n我们可以在实际页面的 html 中跟踪此工具提示的位置:
\n\n当您将鼠标悬停在不同的便利设施上时,此内容会更新。
\n长话短说(他在页面中间说道!),此内容是从服务器上的 php 文件更新的。我们可以请求文件并构造一个字典,将每个便利设施的类名映射到相关描述amenity_icon icon_climate(例如,当转换为适当的 css 选择器时,复合类需要将“”替换为“.” )。您可以在此处.amenity_icon.icon_climate浏览 php 文件。
php 文件:
\n让我们只看文件的开头,以便剖析什么是重复模式的基本单元:
\n$(\'.units_table tr.unitinfo\').each(function(index, el) \nRun Code Online (Sandbox Code Playgroud)\r\n负责更新工具提示的函数是LoadTooltips。CSS 类选择器用于定位每个图标:
$(".units_table .amenity_icon.icon_climate").tooltip\nRun Code Online (Sandbox Code Playgroud)\n我们有指定返回文本的 bodyhandler:
\nresults(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)\nRun Code Online (Sandbox Code Playgroud)\r\n我们有重复组中出现的三位有用信息。元素的类名选择器、短描述和长描述,例如
\n.amenity_icon.icon_climate:我们使用它来将 php 文件描述映射到我们行中便利设施图标的类名。CSS 选择器Temperature Controlled; 工具提示函数的内部h4标记返回文本。简短的介绍Units are heated and/or cooled. See manager for details.; 工具提示函数的内部p标记返回文本。详细描述我编写了 2 个函数GetMatches和GetAmenitiesDescriptions,它们使用正则表达式为每个图标提取所有重复项,并返回一个字典,其中 css 选择器作为键,短字符description : long description作为值。
当我收集每行中的所有图标时:
\nSet icons = html2.querySelectorAll(".amenity_icon")\n \nRun Code Online (Sandbox Code Playgroud)\n我使用字典根据图标的类名返回工具提示描述
\nFor icon = 0 To icons.Length - 1 \'use class name of amenity to look up description\n amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))\nNext \nRun Code Online (Sandbox Code Playgroud)\n然后,我将描述加入到vbNewLine以确保输出位于输出单元格内的不同行上。
您可以在此处探索正则表达式。
\n正则表达式使用|(Or) 语法,因此我在单个列表中返回所有匹配的模式。
arr = GetMatches(re, s, "(\\.amenity_icon\\..*)""|<h4>(.*)<\\/h4>|<p>(.*)<\\/p>")\nRun Code Online (Sandbox Code Playgroud)\n因为我需要不同的子匹配(0,1 或 2 又名 css 类选择器、短 desc、长 desc),所以我使用Select Case i mod 3, 和计数器变量i来提取适当的子匹配。
php 文件中映射的匹配示例:
\n\n特价:
\n我们回到类选择器。Offer2未填充,因此您可以删除。
results(r, 6) = html2.querySelector(".offer1").innerText\nresults(r, 7) = html2.querySelector(".offer2").innerText\nRun Code Online (Sandbox Code Playgroud)\n返回(分别):
\n特价商品请致电,空字符串
\n结束语:
\n因此,上面将引导您完成一行。只需冲洗并在所有行的循环中重复即可。为了提高效率,将数据添加到数组中results;Sheet1然后一次性写入。我可以看到一些小的改进,但速度很快。
编程语言:
\nOption Explicit\nPublic Sub GetInfo()\n Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object\n Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"\n\n Set ws = ThisWorkbook.Worksheets("Sheet1")\n Set html = New HTMLDocument\n Set amenitiesDescriptions = GetAmenitiesDescriptions\n \n With CreateObject("MSXML2.XMLHTTP")\n .Open "GET", URL, False\n .setRequestHeader "User-Agent", "Mozilla/5.0"\n .send\n s = .responseText\n\n html.body.innerHTML = s\n\n Dim headers(), results(), listings As Object, amenities As String\n\n headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")\n Set listings = html.querySelectorAll(".unitinfo")\n\n Dim rowCount As Long, numColumns As Long, r As Long, c As Long\n Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long\n\n rowCount = listings.Length\n numColumns = UBound(headers) + 1\n\n ReDim results(1 To rowCount, 1 To numColumns)\n Dim html2 As HTMLDocument\n Set html2 = New HTMLDocument\n For item = 0 To listings.Length - 1\n r = r + 1\n html2.body.innerHTML = listings.item(item).innerHTML\n results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)\n results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\\"">")(1), "</div>")(0)\n results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)\n results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)\n \n Set icons = html2.querySelectorAll(".amenity_icon")\n ReDim amenitiesInfo(0 To icons.Length - 1)\n \n For icon = 0 To icons.Length - 1 \'use class name of amenity to look up description\n amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))\n Next\n\n amenities = Join$(amenitiesInfo, vbNewLine) \'place each amenity description on a new line within cell when written out\n\n results(r, 5) = amenities\n results(r, 6) = html2.querySelector(".offer1").innerText\n results(r, 7) = html2.querySelector(".offer2").innerText\n Next\n\n ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers\n ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results\n End With\nEnd Sub\n\nPublic Function GetAmenitiesDescriptions() As Object \'retrieve amenities descriptions from php file on server\n Dim s As String, dict As Object, re As Object, i As Long, arr() \'keys based on classname, short desc, full desc\n \' view regex here: https://regex101.com/r/bII5AL/1\n Set dict = CreateObject("Scripting.Dictionary")\n Set re = CreateObject("vbscript.regexp")\n \n With CreateObject("MSXML2.XMLHTTP")\n .Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False\n .setRequestHeader "User-Agent", "Mozilla/5.0"\n .send\n s = .responseText\n \n arr = GetMatches(re, s, "(\\.amenity_icon\\..*)""|<h4>(.*)<\\/h4>|<p>(.*)<\\/p>")\n For i = LBound(arr) To UBound(arr) Step 3 \'build up lookup dictionary for amenities descriptions\n dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)\n Next\n End With\n Set GetAmenitiesDescriptions = dict\nEnd Function\n\nPublic Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant\n Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long\n\n With re\n .Global = True\n .MultiLine = True\n .IgnoreCase = False\n .Pattern = sPattern\n If .test(inputString) Then\n Set matches = .Execute(inputString)\n ReDim arrMatches(0 To matches.Count - 1)\n For Each iMatch In matches\n Select Case i Mod 3\n Case 0\n arrMatches(i) = iMatch.SubMatches.item(0)\n Case 1\n arrMatches(i) = iMatch.SubMatches.item(1)\n Case 2\n arrMatches(i) = iMatch.SubMatches.item(2)\n End Select\n i = i + 1\n Next iMatch\n Else\n ReDim arrMatches(0)\n arrMatches(0) = vbNullString\n End If\n End With\n GetMatches = arrMatches\nEnd Function\nRun Code Online (Sandbox Code Playgroud)\n输出:
\n\n参考资料(VBE > 工具 > 参考资料):
\n