jmr*_*cha 2 html powershell xpath html-parsing html-agility-pack
这是我上周问到的一个跟进问题,发布在这里.我已经超越了原始问题,但现在我遇到了一个稍微不同的问题.
我现在能够获得我感兴趣的项目的属性,如果html标签没有使用GetAttributeValue方法嵌套,这里它是data-pid但我现在无法抓取项目的属性在嵌套标签中,在我的代码片段中,它是日期.我使用xpath和HtmlAgility包来解析这里的html,但在下面的例子中,相同的日期会一遍又一遍地返回.
这是$ item对象的样子:
Attributes : {class, data-pid}
ChildNodes : {#text, a, #text, span...}
Closed : True
ClosingAttributes : {}
FirstChild : HtmlAgilityPack.HtmlTextNode
HasAttributes : True
HasChildNodes : True
HasClosingAttributes : False
Id :
InnerHtml : <a href="/mod/4175126893.html" class="i"><span class="price">$20</span></a> <span class="star"></span> <span class="pl"> <span class="date">Nov
30</span> <a href="/mod/4175126893.html">Unlock Any GSM Cell Phone Today!</a> </span> <span class="l2"> <span class="price">$20</span> <span
class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> </span></span> </span> <a class="gc" href="/mod/"
data-cat="mod">cell phones - by dealer</a> </span>
InnerText : $20 Nov 30 Unlock Any GSM Cell Phone Today! $20 (Des Moines) cell phones - by dealer
LastChild : HtmlAgilityPack.HtmlTextNode
Line : 305
LinePosition : 5408
Name : p
NextSibling : HtmlAgilityPack.HtmlTextNode
NodeType : Element
OriginalName : p
OuterHtml : <p class="row" data-pid="4175126893"> <a href="/mod/4175126893.html" class="i"><span class="price">$20</span></a> <span class="star"></span>
<span class="pl"> <span class="date">Nov 30</span> <a href="/mod/4175126893.html">Unlock Any GSM Cell Phone Today!</a> </span> <span class="l2">
<span class="price">$20</span> <span class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> </span></span> </span> <a
class="gc" href="/mod/" data-cat="mod">cell phones - by dealer</a> </span> </p>
OwnerDocument : HtmlAgilityPack.HtmlDocument
ParentNode : HtmlAgilityPack.HtmlNode
PreviousSibling : HtmlAgilityPack.HtmlTextNode
StreamPosition : 18733
XPath : /html[1]/body[1]/article[1]/section[1]/div[1]/div[2]/p[11]
Attributes : {class, data-pid}
ChildNodes : {#text, a, #text, span...}
Closed : True
ClosingAttributes : {}
Run Code Online (Sandbox Code Playgroud)
我想从outerhtml值中提取数据.
OuterHtml : <p class="row" data-latitude="41.5937565437255" data-longitude="-93.6437636649079" data-pid="4184719674"> <a href="/mod/4184719674.html" class="i"></a>
<span class="star"></span> <span class="pl"> <span class="date">Nov 27</span> <a href="/mod/4184719674.html">iPhone and other Cell Phone Unlocks</a>
</span> <span class="l2"> <span class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> <a href="#" class="maptag"
data-pid="4184719674">map</a></span></span> </span> <a class="gc" href="/mod/" data-cat="mod">cell phones - by dealer</a> </span> </p>
Run Code Online (Sandbox Code Playgroud)
我可以抓住data-pid没问题.这是当前代码的样子:
ForEach ($item in $results) {
# This is working
$ID = $item.GetAttributeValue("data-pid", "")
# This is looping over the same item
$Date = $item.SelectSingleNode("//span[@class='date']").InnerText
}
Run Code Online (Sandbox Code Playgroud)
我想要做的是能够使用我的xpath语句从包含在outerhtml对象中的不同标签中获取属性,但我无法弄清楚如何做到这一点.这是解决问题的最佳方法,还是我应该使用一些正则表达式来获得我想要的价值?
让我知道我需要发布的其他细节.
我没有使用HTML Agility Pack,但AFAICS内置工具应该足够了:
$url = 'http://www.example.com/path/to/some.html'
$html = (Invoke-Webrequest $url).ParsedHTML
$html.getElementsByTagName('p') | ? { $_.className -eq 'row' } | % {
$ID = $_.getAttributeNode('data-pid').value
$Date = $_.getElementsByTagName('span') | ? { $_.className -eq 'date' } |
% { $_.innerText }
# do stuff with $ID and $Date
"{0}: {1}" -f $ID, $Date
}
Run Code Online (Sandbox Code Playgroud)
请注意,Invoke-Webrequest需要PowerShell v3.如果受限于PowerShell v2,请使用Internet Explorer COM对象:
$ie = New-Object -COM InternetExplorer.Application
$ie.Navigate($url)
while ($ie.ReadyState -ne 4) { sleep 100 }
$html = $ie.Document
Run Code Online (Sandbox Code Playgroud)
如果您的HTML文件是本地文件,请用以下内容替换该Invoke-Webrequest行:
$htmlfile = 'C:\path\to\some.html'
$html = New-Object -COM HTMLFile
$html.write((Get-Content $htmlfile | Out-String))
Run Code Online (Sandbox Code Playgroud)