我试图从ClinicalTrials.gov的XML文件中提取信息.该文件按以下方式组织:
<clinical_study>
...
<brief_title>
...
<location>
<facility>
<name>
<address>
<city>
<state>
<zip>
<country>
</facility>
<status>
<contact>
<last_name>
<phone>
<email>
</contact>
</location>
<location>
...
</location>
...
</clinical_study>
Run Code Online (Sandbox Code Playgroud)
我可以在以下代码中使用CRAN的R XML包从XML文件中提取所有位置节点:
library(XML)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))
Run Code Online (Sandbox Code Playgroud)
这样做很好.但是,如果查看数据框,您会注意到xmlToDataFrame函数将所有内容集中在<facility>一个连接的字符串中.解决方案是编写代码以逐列生成数据框,例如,您可以生成
您可以先将XML展平.
flatten_xml <- function(x) {
if (length(xmlChildren(x)) == 0) structure(list(xmlValue(x)), .Names = xmlName(xmlParent(x)))
else Reduce(append, lapply(xmlChildren(x), flatten_xml))
}
dfs <- lapply(getNodeSet(xmlDoc,"//location"), function(x) data.frame(flatten_xml(x)))
allnames <- unique(c(lapply(dfs, colnames), recursive = TRUE))
df <- do.call(rbind, lapply(dfs, function(df) { df[, setdiff(allnames,colnames(df))] <- NA; df }))
head(df)
# city state zip country status last_name phone email last_name.1
# 1 Birmingham Alabama 35294 United States Recruiting Louis B Nabors, MD 205-934-1813 bnabors@uab.edu Louis B Nabors, MD
# 2 Mobile Alabama 36604 United States Recruiting Melanie Alford, RN 251-445-9649 malford@usouthal.edu Pamela Francisco, CCRP
# 3 Phoenix Arizona 85013 United States Recruiting Lynn Ashby, MD 602-406-6262 LASHBY@CHW.EDU Lynn Ashby, MD
# 4 Tucson Arizona 85724 United States Recruiting Jamie Holt 520-626-6800 jholt1@email.arizona.edu Baldassarre Stea, MD, PhD
# 5 Little Rock Arkansas 72205 United States Recruiting Wilma Brooks, RN 501-686-8530 ALEubanks@uams.edu Amanda Eubanks, APN
# 6 Berkeley California 94704 United States Withdrawn <NA> <NA> <NA> <NA>
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1241 次 |
| 最近记录: |