我需要从 XML 中提取某些数据,如下所示(为简洁起见,进行了简化)
<Doc name="Doc1">
<Lists Count="1">
<List Name="List1">
<Points Count="3">
<Point Id="1">
<Tags Count ="1">"a"</Tags>
<Point Position="1" />
</Point>
<Point Id="2">
<Point Position="2" />
</Point>
<Point Id="3">
<Tags Count="1">"c"</Tags>
<Point Position="3" />
</Point>
</Points>
</List>
</Lists>
</Doc>
Run Code Online (Sandbox Code Playgroud)
输出应该是一个数据框,该数据框与每个点 ID 的标签和位置相匹配
Point Tag Position
1 1 a 1
2 2 <NA> 2
3 3 c 3
Run Code Online (Sandbox Code Playgroud)
我是 XML 新手,我正在使用 xml2 包。到目前为止,我可以单独提取每个变量,但由于某些点可能没有 Tag data ,我找不到在三个参数之间进行匹配的方法。
> library(xml2)
> xml_data<-read_xml(...)
> xml_data %>% xml_find_all("//Point") %>% xml_attr("Id")
[1] "1" "2" "3"
> xml_data %>% xml_find_all("//Vertical") %>% xml_attr("Position")
[1] "1" "2" "3"
> xml_data %>% xml_find_all("//Tags") %>% xml_text()
[1] "\"a\"" "\"c\""
Run Code Online (Sandbox Code Playgroud)
purrr并xml2一起顺利进行:
library(xml2)\nlibrary(purrr)\n\ntxt <- \'<Doc name="Doc1">\n <Lists Count="1">\n <List Name="List1">\n <Points Count="3">\n <Point Id="1">\n <Tags Count ="1">"a"</Tags>\n <Point Position="1" /> \n </Point>\n <Point Id="2">\n <Point Position="2" /> \n </Point>\n <Point Id="3">\n <Tags Count="1">"c"</Tags>\n <Point Position="3" /> \n </Point>\n </Points>\n </List>\n </Lists>\n</Doc>\'\n\ndoc <- read_xml(txt)\nxml_find_all(doc, ".//Points/Point") %>% \n map_df(function(x) {\n list(\n Point=xml_attr(x, "Id"),\n Tag=xml_find_first(x, ".//Tags") %>% xml_text() %>% gsub(\'^"|"$\', "", .),\n Position=xml_find_first(x, ".//Point") %>% xml_attr("Position")\n )\n })\n## # A tibble: 3 \xc3\x97 3\n## Point Tag Position\n## <chr> <chr> <chr>\n## 1 1 a 1\n## 2 2 <NA> 2\n## 3 3 c 3\nRun Code Online (Sandbox Code Playgroud)\n