bre*_*tdj 35 html regex xml parsing vba
目标
我希望从Cricinfo网站上获取 20/20板球记分卡数据,理想情况下将其转换为CSV格式,以便在Excel中进行数据分析
例如,目前的澳大利亚Big Bash 2011/12记分卡可以从
背景
我精通使用VBA(自动化IE
或使用XMLHTTP
然后使用正则表达式)从网站上抓取数据,即
从HTML TD和Tr中提取值
在同一个问题中,发表了一条评论,建议使用html解析 - 我之前没有遇到过 - 所以我看了一些问题,比如除了XHTML自包含标签之外的RegEx匹配开放标签
询问
虽然我可以编写一个正则表达式来解析下面的板球数据,但我想知道如何使用html解析有效地检索这些结果.
请记住,我的偏好是可重复的CSV格式,包含:
Nirvana对我来说是一个可以使用VBA或VBscript部署的解决方案,所以我可以完全自动化我的分析,但我认为我将不得不使用一个单独的工具来进行HTML解析.
示例站点链接和要提取的数据
Sid*_*out 50
我使用"VBA"有两种技术.我将逐一描述它们.
1)使用FireFox/Firebug Addon/Fiddler
2)使用Excel的内置工具从Web获取数据
由于这篇文章会被许多人阅读所以我甚至会覆盖这个明显的内容.请随意跳过你知道的任何部分
1)使用FireFox/Firebug Addon/Fiddler
FireFox:http://en.wikipedia.org/wiki/Firefox 免费下载(http://www.mozilla.org/en-US/firefox/new/)
Firebug Addon:http://en.wikipedia.org/wiki/Firebug_%28software%29 免费下载(https://addons.mozilla.org/en-US/firefox/addon/firebug/)
Fiddler:http://en.wikipedia.org/wiki/Fiddler_%28software%29 免费下载(http://www.fiddler2.com/fiddler2/)
安装Firefox后,安装Firebug Addon.Firebug Addon允许您检查网页中的不同元素.例如,如果您想知道按钮的名称,只需右键单击它并单击"使用Firebug检查元素",它将为您提供该按钮所需的所有详细信息.
另一个例子是在网站上找到一个表格的名称,该表格包含您需要报废的数据.
我只在使用XMLHTTP时才使用Fiddler.它可以帮助我查看单击按钮时传递的确切信息.由于BOTS数量的增加,现在,大多数网站都在阻止自动报废,捕获鼠标坐标并传递信息,而fiddler实际上可以帮助您调试正在传递的信息.我不会在这里详细介绍它,因为这些信息可以被恶意使用.
现在让我们举一个关于如何抓取问题中发布的URL的简单示例
http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html
首先,让我们找到具有该信息的表的名称.只需右键单击表格,然后单击"使用Firebug检查元素",它将为您提供以下快照.
所以现在我们知道我们的数据存储在一个名为"inningsBat1"的表中.如果我们可以将该表的内容提取到Excel文件中,那么我们肯定可以使用这些数据进行分析.下面是将在Sheet1中转储该表的示例代码
在我们继续之前,我建议关闭所有Excel并启动一个新实例.
启动VBA并插入Userform.放置一个命令按钮和一个webcrowser控件.您的Userform可能如下所示
将此代码粘贴到Userform代码区域中
Option Explicit
'~~> Set Reference to Microsoft HTML Object Library
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub CommandButton1_Click()
Dim URL As String
Dim oSheet As Worksheet
Set oSheet = Sheets("Sheet1")
URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"
PopulateDataSheets oSheet, URL
MsgBox "Data Scrapped. Please check " & oSheet.Name
End Sub
Public Sub PopulateDataSheets(wsk As Worksheet, URL As String)
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim insertRow As Long, Row As Long, col As Long
On Error GoTo whoa
WebBrowser1.navigate URL
WaitForWBReady
Set tbl = WebBrowser1.Document.getElementById("inningsBat1")
With wsk
.Cells.Clear
insertRow = 0
For Row = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(Row)
If Trim(tr.innerText) <> "" Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Total" Then
insertRow = insertRow + 1
For col = 0 To tr.Cells.Length - 1
.Cells(insertRow, col + 1) = tr.Cells(col).innerText
Next
End If
End If
End If
Next
End With
whoa:
Unload Me
End Sub
Private Sub Wait(ByVal nSec As Long)
nSec = nSec + Timer
While Timer < nSec
DoEvents
Sleep 100
Wend
End Sub
Private Sub WaitForWBReady()
Wait 1
While WebBrowser1.ReadyState <> 4
Wait 3
Wend
End Sub
Run Code Online (Sandbox Code Playgroud)
现在运行您的Userform并单击Command按钮.您会注意到数据被转储到Sheet1中.查看快照
同样,你也可以刮掉其他信息.
2)使用Excel的内置工具从Web获取数据
我相信你正在使用Excel 2007所以我将把它作为一个例子来刮掉上面提到的链接.
导航到Sheet2.现在导航到Data Tab并单击最右侧的"From Web"按钮.查看快照.
在"新建Web查询窗口"中输入URL,然后单击"开始"
上传页面后,通过单击快照中显示的小箭头选择要导入的相关表.完成后,点击"导入"
然后,Excel将询问您要将数据导入的位置.选择相关单元格,然后单击"确定".你完成了!数据将导入您指定的单元格.
如果你希望你可以录制一个宏并自动化这个:)
这是我录制的宏.
Sub Macro1()
With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _
, Destination:=Range("$A$1"))
.Name = "524915"
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = """inningsBat1"""
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
End Sub
Run Code Online (Sandbox Code Playgroud)
希望这可以帮助.如果您仍有疑问,请告诉我.
希德
对于其他对此感兴趣的人,我最终根据Siddhart Rout的早期答案使用了下面的代码
XMLHttp
比自动化快得多 IE
X
变量中) Public Sub PopulateDataSheets_XML()
Dim URL As String
Dim ws As Worksheet
Dim lngRow As Long
Dim lngRecords As Long
Dim lngWrite As Long
Dim lngSpare As Long
Dim lngInnings As Long
Dim lngRow1 As Long
Dim X(1 To 15, 1 To 4) As String
Dim objFSO As Object
Dim objTF As Object
Dim xmlHttp As Object
Dim htmldoc As HTMLDocument
Dim htmlbody As htmlbody
Dim tbl As HTMLTable
Dim tr As HTMLTableRow
Dim strInnings As String
s = Timer()
Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")
Set objFSO = CreateObject("scripting.filesystemobject")
X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"
X(1, 2) = 501198
X(1, 3) = 501271
X(1, 4) = "indian-premier-league-2011"
X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"
X(2, 2) = 524915
X(2, 3) = 524945
X(2, 4) = "big-bash-league-2011"
X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"
X(3, 2) = 461028
X(3, 3) = 461047
X(3, 4) = "big-bash-league-2010"
Set htmldoc = New HTMLDocument
Set htmlbody = htmldoc.body
For lngRow = 1 To UBound(X, 1)
If Len(X(lngRow, 1)) = 0 Then Exit For
Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")
For lngRecords = X(lngRow, 2) To X(lngRow, 3)
URL = X(lngRow, 1) & lngRecords & ".html"
xmlHttp.Open "GET", URL
xmlHttp.send
Do While xmlHttp.Status <> 200
DoEvents
Loop
htmlbody.innerHTML = xmlHttp.responseText
objTF.writeline X(lngRow, 1) & lngRecords & ".html"
For lngInnings = 1 To 2
strInnings = "Innings " & lngInnings
objTF.writeline strInnings
Set tbl = Nothing
On Error Resume Next
Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)
On Error GoTo 0
If Not tbl Is Nothing Then
lngWrite = 0
For lngRow1 = 0 To tbl.Rows.Length - 1
Set tr = tbl.Rows(lngRow1)
If Trim(tr.innerText) <> vbNewLine Then
If tr.Cells.Length > 2 Then
If tr.Cells(1).innerText <> "Extras" Then
If Len(tr.Cells(1).innerText) > 0 Then
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
End If
Else
objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)
lngWrite = lngWrite + 1
Exit For
End If
End If
End If
Next
For lngSpare = 12 To lngWrite Step -1
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
Else
For lngSpare = 1 To 13
objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)
Next
End If
Next
Next
Next
'Call ConsolidateSheets
End Sub
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
24591 次 |
最近记录: |