r - xpathApply在XMLNodeSet上(带有XML包)

Question

r - xpathApply在XMLNodeSet上(带有XML包)

我试图在R中的XML包中使用xpathApply函数从html文件中提取某些数据.但是,在html文档的某些父节点上使用xpathApply后,生成的对象的类变为XMLNodeSet,我无法在此类对象上进一步使用xpathApply,因为出现此错误消息:"UseMethod("xpathApply")中的错误:没有适用于'xpathApply'的方法应用于类"XMLNodeSet"的对象"

这是我试图复制我的问题的R脚本(这个例子只是一个简单的表,我知道我可以使用readHTMLtable函数,但我需要使用更多的低级函数才能工作,因为我的实际html比这个简单更复杂表):

library(XML)
y <- htmlParse(htmlfile)
x <- xpathApply(y, "//table/tr")
z <- xpathApply(x, "/td")

Run Code Online (Sandbox Code Playgroud)

这是"htmlfile":

<table>
<tr>
<td> Test1.1 </td> <td> Test1.2 </td>
</tr>
<tr>
<td> Test1.3 </td> <td> Test1.4 </td>
</tr>
</table>

Run Code Online (Sandbox Code Playgroud)

使用xpathApply后,是否有任何方法可以在节点上进一步工作？或者还有其他好的替代方案可以解决节点中的数据吗？

Answer 1

ags*_*udy 2

一旦获得节点列表，您就可以对其应用函数来提取节点。函数类似于xmlValueor xmlGetAttr.... 例如：

x <- xpathApply(y, "//table/tr")
sapply(x,xmlValue)          ## it a list of nodes..
 " Test1.1  Test1.2 " " Test1.3  Test1.4 "

Run Code Online (Sandbox Code Playgroud)

这相当于：

xpathSApply(y,"//table/tr",xmlValue)
" Test1.1  Test1.2 " " Test1.3  Test1.4 "

Run Code Online (Sandbox Code Playgroud)

编辑

我相信你的问题可以通过正确的 xpath 得到解决。您应该像使用数据库一样学习使用 xml 文件。xpath 类似于 sql 查询。它速度很快，并且许多浏览器可以帮助您生成正确的 xpath。

例如：

 xpathSApply(y,"//table/tr[2]/td[1]",xmlValue) #  second tr and first td
 [1] " Test1.3 "
 xpathSApply(y,"//table/tr[2]/td[3]",xmlValue) #  second tr and third td

Run Code Online (Sandbox Code Playgroud)

编辑

如果OP想要复制XML结构（以相同的顺序获取tr和td）

这是方法，我认为这不是更有效的方法......

nn.trs <- length(xpathSApply(y,"//table/tr",I))
lapply(seq(nn.trs),function(i){
       xpathSApply(y,paste("//table/tr[",i,"]/td",sep=''),xmlValue)
})
[[1]]
[1] " Test1.1 " " Test1.2 "

[[2]]
[1] " Test1.3 " " Test1.4 "

Run Code Online (Sandbox Code Playgroud)

如果每个 tr 中的 td 数量都相同，则可以替换lapply为sapply并得到：

    [,1]        [,2]       
[1,] " Test1.1 " " Test1.3 "
[2,] " Test1.2 " " Test1.4 "

Run Code Online (Sandbox Code Playgroud)

但我认为在这种情况下 readHtmlTable 更好..

归档时间：	12 年，11 月前
查看次数：	11572 次
最近记录：	10 年，11 月前