我有一个Data如下数据集:
dput(Data)
structure(list(FN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "20131202-0985 ", class = "factor"), Values = structure(c(1L,
8L, 7L, 6L, 5L, 9L, 2L, 4L, 3L), .Label = c("|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567",
"|8121|B01|SOMERSET STN", "|96942883", "|SN30|SMRT\n", "CENTRAL",
"FOUR SEASONS HOTEL", "HOTEL", "IKEA", "nanyang avenue"), class = "factor"),
IND = structure(c(4L, 1L, 1L, 1L, 1L, 6L, 3L, 2L, 5L), .Label = c("BN",
"BR", "BS", "LOC", "PN", "RN"), class = "factor")), .Names = c("FN",
"Values", "IND"), class = "data.frame", row.names = c(NA, -9L
))
Run Code Online (Sandbox Code Playgroud)
我希望将上述数据集转换为以下格式的数据框(out_data).目前我Data有3列 - 需要将这些列转换为以下格式的16列.我需要重新设置我的输入 - 在屏幕截图中确切地给出数据框.我无法改变以下结构 -
colnames(out_data) <- ("FN","H_BLK","S_N/R_N","B_N","FL_N","U_N","PC","XC","YC","BS","BRF","LCT_DEC","BRN","BO PN","S_TY_CD")
Run Code Online (Sandbox Code Playgroud)

inputnand中的Multiple值列始终为以下格式:
|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567 -
|PC|H_BLK|S_N/R_N|XC|YC|8121|B01|SOMERSET STN - > |BS|BRF|LCT_DEC|SN30|SMRT ------> |BRN|BO如果
IND =LOC - then |PC|H_BLK|S_N/R_N|XC|YC` get updated with S_TY_CD=LOC
IND= BN - then B_N column should be updated with S_TY_CD=BN
IND= RN - then _N/R_N column should be updated with S_TY_CD=RN
IND= BS then `|BS|BRF|LCT_DEC` should be updated with S_TY_CD=BS
IND= BR then `|BRN|BO` should be updated with S_TY_CD=BR
IND= PN then PN with S_TY_CD=PN
Run Code Online (Sandbox Code Playgroud)
有没有一种有效的方法来做到这一点.
这是一种转变方法.首先,我为各种子问题定义了一些辅助函数.
#define out cols
outcols<-c("FN", "H_BLK", "S_N/R_N", "B_N", "FL_N", "U_N", "PC",
"XC", "YC", "BS", "BRF", "LCT_DEC", "BRN","BO","PN","S_TY_CD")
#identify parts for each compound value
namevals <- function(ind, vals) {
names<-if (ind=="LOC") {
c("PC","H_BLK","S_N/R_N","XC","YC")
} else if (ind=="BN") {
c("B_N")
} else if (ind=="RN") {
c("S_N/R_N")
} else if (ind=="BS") {
c("BS","BRF","LCT_DEC")
} else if (ind=="BR") {
c("BRN","BO")
} else if (ind=="PN") {
c("PN")
}
stopifnot(length(names)==length(vals))
stopifnot(all(names %in% outcols))
names(vals)<-names
vals
}
#add missing values for row
fillrow <- function(nvals) {
r<-rep(NA, length(outcols))
r[match(names(nvals), outcols)]<-nvals
r
}
Run Code Online (Sandbox Code Playgroud)
现在我将这些应用于数据的每一行,mapply以返回一个字符向量.在这里,我们确保拆分管道上的"值"列并删除前导管.
#combine rows into character matrix
dt<-mapply(function(fn,vals,ind){
x<-c(FN=fn,namevals(ind, vals), "S_TY_CD"=ind)
fillrow(x)
},
as.character(Data$FN),
strsplit(gsub("^\\|","",as.character(Data$Values)),"|", fixed=T),
as.character(Data$IND)
)
Run Code Online (Sandbox Code Playgroud)
最后,我们整理数据,以便可以将其写入文件write.table.请注意,所有缺失值都是真R NA值.在write.table,您可以设置na = ""是否要打印为空白值而不是默认的"NA"值.
#turn matrix into data.frame with proper names
dd<-data.frame(unname(t(dt)), stringsAsFactors=F)
names(dd)<-outcols
dd
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
398 次 |
| 最近记录: |