使用powershell以HTML格式检索文本

Lit*_*ish 4 html regex powershell

在这个HTML代码中:

<div id="ajaxWarningRegion" class="infoFont"></div>
  <span id="ajaxStatusRegion"></span>
  <form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
    <pre>
      Creating a new ZIP of IP Phone files from HTTP/PhoneBackup 
      and HTTPS/PhoneBackup
    </pre>
    <pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
    <pre>Reports Success</pre>
    <pre></pre>
    <a href =  /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
      Download the new ZIP of IP Phone files
    </a>
  </div>
Run Code Online (Sandbox Code Playgroud)

我要检索的文本IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip之间或只是日期和时间IP_PHONE_BACKUP-以及.zip

我怎样才能做到这一点 ?

Mic*_*ens 9

让这个问题如此有趣的原因在于,HTML看起来和气味就像XML一样,后者由于其良好的行为和有序的结构而具有更好的可编程性.在理想的世界中,HTML将是XML的一个子集,但现实世界中的HTML显然不是 XML.如果您将问题中的示例提供给任何XML解析器,它将会避免各种违规行为.话虽这么说,可以通过一行PowerShell实现所需的结果.这个返回href的全文:

Select-NodeContent $doc.DocumentNode "//a/@href"
Run Code Online (Sandbox Code Playgroud)

这个提取所需的子字符串:

Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
Run Code Online (Sandbox Code Playgroud)

但是,捕获的是开销/设置,以便能够运行那一行代码.你需要:

  • 安装HtmlAgilityPack使HTML解析看起来就像XML解析一样.
  • 如果要解析实时网页,请安装PowerShell社区扩展.
  • 了解XPath能够构建到目标节点的可导航路径.
  • 理解正则表达式,以便能够从目标节点中提取子字符串.

满足这些要求后,您可以将HTMLAgilityPath类型添加到您的环境并定义Select-NodeContent功能,如下所示.代码的最后一部分显示了如何$doc为上述单行中使用的变量赋值.我将展示如何根据您的需要从文件或Web加载HTML.

Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath

function Select-NodeContent(
    [HtmlAgilityPack.HtmlNode]$node,
    [string] $xpath,
    [string] $regex,
    [Object] $default = "")
{
    if ($xpath -match "(.*)/@(\w+)$") {
        # If standard XPath to retrieve an attribute is given,
        # map to supported operations to retrieve the attribute's text.
        ($xpath, $attribute) = $matches[1], $matches[2]
        $resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
    }
    else { # retrieve an element's text
        $resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.InnerText } { $default }
    }
    # If a regex is given, use it to extract a substring from the text
    if ($regex) {
        if ($text -match $regex) { $text = $matches[1] }
        else { $text = $default }
    }
    return $text
}

$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this  PSCX cmdlet to load a live web page
Run Code Online (Sandbox Code Playgroud)