cov*_*r51 5 string grep r dplyr
拿这个样本数据:
data.frame(a_1=c("Apple","Grapes","Melon","Peach"),a_2=c("Nuts","Kiwi","Lime","Honey"),a_3=c("Plum","Apple",NA,NA),a_4=c("Cucumber",NA,NA,NA))
a_1 a_2 a_3 a_4
1 Apple Nuts Plum Cucumber
2 Grapes Kiwi Apple <NA>
3 Melon Lime <NA> <NA>
4 Peach Honey <NA> <NA>
Run Code Online (Sandbox Code Playgroud)
基本上我想在每行的最后一列上运行grep,而不是NA.因此我在grep中的x("pattern",x)应该是:
Cucumber
Apple
Lime
Honey
Run Code Online (Sandbox Code Playgroud)
我有一个整数告诉我哪个a_N是最后一个:
numcol <- rowSums(!is.na(df[,grep("(^a_)\\d", colnames(df))]))
Run Code Online (Sandbox Code Playgroud)
到目前为止,我已尝试过与ave(),apply()和dplyr结合使用的类似内容:
grepl("pattern",df[,sprintf("a_%i",numcol)])
Run Code Online (Sandbox Code Playgroud)
但是,我不能让它工作.请记住,我的数据集非常大,因此我希望使用矢量化解决方案或mb dplyr.非常感谢帮助.
/ e:谢谢,这是一个非常好的解决方案.我的想法太复杂了.(正则表达式是由于我更具体的数据)
A5C*_*2T1 11
这里不需要正则表达式.只需使用apply+ tail+ na.omit:
> apply(mydf, 1, function(x) tail(na.omit(x), 1))
[1] "Cucumber" "Apple" "Lime" "Honey"
Run Code Online (Sandbox Code Playgroud)
我不知道这在速度方面有何比较,但你也可以使用"data.table"和"reshape2"的组合,如下所示:
library(data.table)
library(reshape2)
na.omit(melt(as.data.table(mydf, keep.rownames = TRUE),
id.vars = "rn"))[, value[.N], by = rn]
# rn V1
# 1: 1 Cucumber
# 2: 2 Apple
# 3: 3 Lime
# 4: 4 Honey
Run Code Online (Sandbox Code Playgroud)
或者,甚至更好:
melt(as.data.table(df, keep.rownames = TRUE),
id.vars = "rn", na.rm = TRUE)[, value[.N], by = rn]
# rn V1
# 1: 1 Cucumber
# 2: 2 Apple
# 3: 3 Lime
# 4: 4 Honey
Run Code Online (Sandbox Code Playgroud)
这将是多快.在800k行数据集上,apply花费约50秒,而data.table接近大约2.5秒.
另一种可能非常快的替代方案:
DF[cbind(seq_len(nrow(DF)), max.col(!is.na(DF), "last"))]
#[1] "Cucumber" "Apple" "Lime" "Honey"
Run Code Online (Sandbox Code Playgroud)
"DF"的地方:
DF = structure(list(a_1 = structure(1:4, .Label = c("Apple", "Grapes",
"Melon", "Peach"), class = "factor"), a_2 = structure(c(4L, 2L,
3L, 1L), .Label = c("Honey", "Kiwi", "Lime", "Nuts"), class = "factor"),
a_3 = structure(c(2L, 1L, NA, NA), .Label = c("Apple", "Plum"
), class = "factor"), a_4 = structure(c(1L, NA, NA, NA), .Label = "Cucumber", class = "factor")), .Names = c("a_1",
"a_2", "a_3", "a_4"), row.names = c(NA, -4L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)