QHa*_*arr 5 html excel vba dom web-scraping
情况:
我试图检查一个变量a,DispStaticNodeList在本地窗口中显示为一个对象;每次我尝试这样做时,Excel 都会崩溃。
这是locals 窗口中的变量a,显然是 type DispStaticNodeList:
重现 Excel 崩溃:
For Each也会导致崩溃。 * TestFail研究亮点:
Excel+ Crash+DispStaticNodeList取得了零点的结果; 至少我使用的谷歌搜索词是这样。很确定我的 Google-Fu 很弱。如果我相信这篇文章,我正在处理一个由MSHTML.
根据这个:
如果名称是 DispStaticNodeList,我们可以很确定它是一个数组..(或至少具有数组语义)。
基于第 3 点,我编写了TestPass下面的代码,它确实成功地循环了它,但我不完全明白为什么。我已经设置了一个对象,然后循环了它的 len!
NodeList 对象是节点的集合,例如由 Node.childNodes 和 document.querySelectorAll() 方法等属性返回的节点。
所以看起来对象可能是 a NodeList,它给出了直接窗口中的描述似乎是正确的,作为一个列表,我可以遍历它的长度,但不确定为什么For Each不起作用以及为什么 Excel 崩溃。同事表示,由于数据的分层性质,它可能会崩溃。我进一步注意到有一些类叫做IDOMNodeIteratorand NodeIterator,但我不确定我是否可以使用符合这里NodeList方法描述的那些类。
题:
什么是a以及为什么在尝试检查或循环时会导致 Excel 崩溃For Each?
成功循环的代码:
Option Explicit
Public Sub TestPass()
Dim html As HTMLDocument
Set html = GetTestHTML
Dim a As Object, b As Object
Set a = html.querySelectorAll("div.intro p")
Dim i As Long
For i = 0 To Len(a) -1
On Error Resume Next
Debug.Print a(i).innerText '<== HTMLParaElement
On Error GoTo 0
Next i
End Sub
Public Function GetTestHTML(Optional ByVal url As String = "https://www.w3schools.com/cssref/trysel.asp") As HTMLDocument
Dim http As New XMLHTTP60
Dim html As New HTMLDocument
With http 'Set http = CreateObject("MSXML2.XMLHttp60")
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
Set GetTestHTML = html
End With
End Function
Run Code Online (Sandbox Code Playgroud)
*导致崩溃的TestFail代码:
Public Sub TestFail()
Dim html As HTMLDocument
Set html = GetTestHTML
Dim a As Object, b As Object
Set a = html.querySelectorAll("div.intro p")
For Each b In a
Next b
End Sub
Run Code Online (Sandbox Code Playgroud)
笔记:
我向一位同事发送了一个测试工作簿,他也能够通过给定的示例重现这种行为。
项目参考:
HTML 示例(还提供了链接)
Option Explicit
Public Sub TestPass()
Dim html As HTMLDocument
Set html = GetTestHTML
Dim a As Object, b As Object
Set a = html.querySelectorAll("div.intro p")
Dim i As Long
For i = 0 To Len(a) -1
On Error Resume Next
Debug.Print a(i).innerText '<== HTMLParaElement
On Error GoTo 0
Next i
End Sub
Public Function GetTestHTML(Optional ByVal url As String = "https://www.w3schools.com/cssref/trysel.asp") As HTMLDocument
Dim http As New XMLHTTP60
Dim html As New HTMLDocument
With http 'Set http = CreateObject("MSXML2.XMLHttp60")
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
Set GetTestHTML = html
End With
End Function
Run Code Online (Sandbox Code Playgroud)
编辑:我还能够以以下方式循环:
Public Sub Test()
Dim html As MSHTML.HTMLDocument, i As Long
Set html = GetTestHTML
For i = 0 To html.querySelectorAll("div.intro p").Length - 1
Debug.Print html.querySelectorAll("div.intro p")(i).innerText
Next i
End Sub
Run Code Online (Sandbox Code Playgroud)
If the name is DispStaticNodeList, we can be pretty sure it's an array..(or at least has array semantics).
Arrays can normally be iterated with a For Each loop, however it's more efficient to iterate them using a For loop. Looks like what you're getting isn't exactly an array, and while it appears to support indexing, it apparently doesn't support enumeration, which would explain the blowing up when you attempt to enumerate it with a For Each loop.
Looks like the locals toolwindow might be using For Each semantics to list the items in the collection.
I'm not familiar with that particular library so this is a bit of (educated) guesswork, but it's pretty easy to make a custom COM collection type that can't be iterated with a For Each loop in VBA - normally the error is caught on the VBA side though... Seems there might be a bug in the library's enumerator implementation (assuming there's an enumerator for it) causing it to throw an exception that ends up unhandled and somehow takes everything down with it... thing is, you can't fix & recompile that library... so the only thing you can do is to avoid iterating that type with a For Each loop, and avoid expanding it in the locals toolwindow (and so, ...save your work often!).
本文从 C#/.NET 的角度很好地阐述了 COM 枚举的工作原理。当然,该库不是托管代码 (.NET),但发挥作用的 COM 概念是相同的。
TL;DR:并不是因为你能,For...Next所以你就能For Each;涉及的 COM 类型必须显式支持枚举。如果 VBA 代码使用For Each循环进行编译,则确实如此,因此它一定是类型枚举器中的错误。