如何将带有父属性的XML属性解析为R中的数据帧

Gen*_*nom 2 xml xpath r

我正在尝试解析 XML 文件节点和属性。文件内有一组具有属性的节点。嵌套的 XML 结构类似于一个数据框架,其中包含一个“我想将其解析为数据框架”。

\n\n

这是一个示例文件:

\n\n
<?xml version="1.0" encoding="UTF-8"?>\n<TrackMate version="3.8.0">\n  <Model spatialunits="\xc2\xb5m" timeunits="sec">\n    <AllTracks>\n      <Track name="Track_2" TRACK_ID="2" NUMBER_SPOTS="140" NUMBER_GAPS="0" >\n        <Edge SPOT_SOURCE_ID="960769" SPOT_TARGET_ID="960778" LINK_COST="0.08756957830926632" />\n        <Edge SPOT_SOURCE_ID="958304" SPOT_TARGET_ID="958308" LINK_COST="1.4003359672950089" />\n        <Edge SPOT_SOURCE_ID="958316" SPOT_TARGET_ID="958322" LINK_COST="1.6985623204008202" />\n      </Track>\n      <Track name="Track_145" TRACK_ID="145" NUMBER_SPOTS="141" NUMBER_GAPS="0" >\n        <Edge SPOT_SOURCE_ID="961623" SPOT_TARGET_ID="961628" LINK_COST="2.2678642015413755" />\n        <Edge SPOT_SOURCE_ID="962122" SPOT_TARGET_ID="962127" LINK_COST="38.20777704254654" />\n        <Edge SPOT_SOURCE_ID="961869" SPOT_TARGET_ID="961873" LINK_COST="0.2895609647324684" />\n      </Track>\n    </AllTracks>\n  </Model>\n</TrackMate>\n
Run Code Online (Sandbox Code Playgroud)\n\n

我想创建一个具有所有边缘属性和父级 TRACK_ID 属性的数据框。我可以轻松地创建具有所有边缘属性的数据框:

\n\n
edges = data.frame(t(data.frame(xml_attrs(xml_find_all(xmlDoc, xpath = paste0(\'/TrackMate/Model/AllTracks//Edge\'))))))\nrow.names(edges) = NULL\n
Run Code Online (Sandbox Code Playgroud)\n\n

但随后相应的轨道 ID 就会丢失。我可以用 for 循环解决这个问题,但这通常不是“R 方式”。我想知道是否有更简单的解决方案?(例如使用 xpath 查询)。

\n\n

所以最终想要的输出将是这个数据框:\n输出数据帧

\n\n

编辑:这更接近,但随后的 Track 节点和 Edge 节点混合在一个列表中。

\n\n
xml_find_all(xmlDoc, xpath = paste0(\'/TrackMate/Model/AllTracks//Edge | /TrackMate/Model/AllTracks/Track\'))\n
Run Code Online (Sandbox Code Playgroud)\n

Wim*_*pel 5

“技巧”是获取所有边缘节点的列表,然后从那里开始使用...您可以使用fromxpath从每个边缘节点中选择 Trach 节点。ancestorxpath

\n\n

使用的库

\n\n
#load libraries\nlibrary( xml2 )\nlibrary( magrittr )\n
Run Code Online (Sandbox Code Playgroud)\n\n

样本数据

\n\n
doc <- read_xml(\'<?xml version="1.0" encoding="UTF-8"?>\n  <TrackMate version="3.8.0">\n    <Model spatialunits="\xc2\xb5m" timeunits="sec">\n      <AllTracks>\n      <Track name="Track_2" TRACK_ID="2" NUMBER_SPOTS="140" NUMBER_GAPS="0" >\n        <Edge SPOT_SOURCE_ID="960769" SPOT_TARGET_ID="960778" LINK_COST="0.08756957830926632" />\n          <Edge SPOT_SOURCE_ID="958304" SPOT_TARGET_ID="958308" LINK_COST="1.4003359672950089" />\n            <Edge SPOT_SOURCE_ID="958316" SPOT_TARGET_ID="958322" LINK_COST="1.6985623204008202" />\n              </Track>\n              <Track name="Track_145" TRACK_ID="145" NUMBER_SPOTS="141" NUMBER_GAPS="0" >\n                <Edge SPOT_SOURCE_ID="961623" SPOT_TARGET_ID="961628" LINK_COST="2.2678642015413755" />\n                  <Edge SPOT_SOURCE_ID="962122" SPOT_TARGET_ID="962127" LINK_COST="38.20777704254654" />\n                    <Edge SPOT_SOURCE_ID="961869" SPOT_TARGET_ID="961873" LINK_COST="0.2895609647324684" />\n                      </Track>\n                      </AllTracks>\n                      </Model>\n                      </TrackMate>\')\n
Run Code Online (Sandbox Code Playgroud)\n\n

代码

\n\n
#find all edge nodes\nedge.nodes <- xml_find_all( doc, ".//Edge")\n#build the data.frame\ndata.frame( TRACK_ID = xml_find_first( edge.nodes, ".//ancestor::Track") %>% xml_attr("TRACK_ID"),\n            SPOT_SOURCE_ID = edge.nodes %>% xml_attr("SPOT_SOURCE_ID"),\n            SPOT_TARGET_ID = edge.nodes %>% xml_attr("SPOT_TARGET_ID"),\n            LINK_COST = edge.nodes %>% xml_attr("LINK_COST") )\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出

\n\n
#   TRACK_ID SPOT_SOURCE_ID SPOT_TARGET_ID           LINK_COST\n# 1        2         960769         960778 0.08756957830926632\n# 2        2         958304         958308  1.4003359672950089\n# 3        2         958316         958322  1.6985623204008202\n# 4      145         961623         961628  2.2678642015413755\n# 5      145         962122         962127   38.20777704254654\n# 6      145         961869         961873  0.2895609647324684\n
Run Code Online (Sandbox Code Playgroud)\n