从HTML标记中的文件中删除文本

use*_*420 1 excel text-extraction analysis extract web-scraping

我有一个文件,我想从中提取日期,它是一个HTML源文件,所以它充满了我不需要的代码和短语.我需要提取包含在特定HTML标记中的日期的每个实例:

abbr title ="((这是我需要的文字))"data-utime ="

实现这一目标的最简单方法是什么?

Dic*_*ika 6

如果您使用的是Excel VBA,请将参考(工具 - 参考)设置为MSHTML库(Microsoft HTML Object Library在参考菜单中标题)

Sub ScrapeDateAbbr()

    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Dim sFile As String, lFile As Long
    Dim sHtml As String

    'read in the file
    lFile = FreeFile
    sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
    Open sFile For Input As lFile
    sHtml = Input$(LOF(lFile), lFile)

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = sHtml

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub
Run Code Online (Sandbox Code Playgroud)

我假设您在源文件中调用该文件是本地的.如果您需要先下载它,则需要另一个对MSXML和此代码的引用

Sub ScrapeDateAbbrDownload()

    Dim xHttp As MSXML2.XMLHTTP
    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement

    Set xHttp = New MSXML2.XMLHTTP
    xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
    xHttp.send

    Do
        DoEvents
    Loop Until xHttp.readyState = 4

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = xHttp.responseText

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub
Run Code Online (Sandbox Code Playgroud)