Lit*_*ish 4 html regex powershell
在这个HTML代码中:
<div id="ajaxWarningRegion" class="infoFont"></div>
<span id="ajaxStatusRegion"></span>
<form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
<pre>
Creating a new ZIP of IP Phone files from HTTP/PhoneBackup
and HTTPS/PhoneBackup
</pre>
<pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
<pre>Reports Success</pre>
<pre></pre>
<a href = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
Download the new ZIP of IP Phone files
</a>
</div>
Run Code Online (Sandbox Code Playgroud)
我要检索的文本IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip之间或只是日期和时间IP_PHONE_BACKUP-以及.zip
我怎样才能做到这一点 ?
让这个问题如此有趣的原因在于,HTML看起来和气味就像XML一样,后者由于其良好的行为和有序的结构而具有更好的可编程性.在理想的世界中,HTML将是XML的一个子集,但现实世界中的HTML显然不是 XML.如果您将问题中的示例提供给任何XML解析器,它将会避免各种违规行为.话虽这么说,可以通过一行PowerShell实现所需的结果.这个返回href的全文:
Select-NodeContent $doc.DocumentNode "//a/@href"
Run Code Online (Sandbox Code Playgroud)
这个提取所需的子字符串:
Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
Run Code Online (Sandbox Code Playgroud)
但是,捕获的是开销/设置,以便能够运行那一行代码.你需要:
满足这些要求后,您可以将HTMLAgilityPath类型添加到您的环境并定义Select-NodeContent功能,如下所示.代码的最后一部分显示了如何$doc为上述单行中使用的变量赋值.我将展示如何根据您的需要从文件或Web加载HTML.
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath
function Select-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,
[string] $xpath,
[string] $regex,
[Object] $default = "")
{
if ($xpath -match "(.*)/@(\w+)$") {
# If standard XPath to retrieve an attribute is given,
# map to supported operations to retrieve the attribute's text.
($xpath, $attribute) = $matches[1], $matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { # retrieve an element's text
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
# If a regex is given, use it to extract a substring from the text
if ($regex) {
if ($text -match $regex) { $text = $matches[1] }
else { $text = $default }
}
return $text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page
Run Code Online (Sandbox Code Playgroud)