我有一个很大的数据框,其中包含未知的列名和数字值1、2、3或4。现在,我想用其列名替换所有4个值,并用一个空值替换所有1、2和3。
当然,我可以进行某种循环,如下所示:
df <- data.frame(id=1:8,unknownvarname1=c(1:4,1:4),unknownvarname2=c(4:1,4:1))
for (i in 2:length(df)){
df[,i] <- as.character(df[,i])
df[,i] <- mgsub::mgsub(df[,i],c(1,2,3,4),c("","","",names(df)[i]))
}
Run Code Online (Sandbox Code Playgroud)
结果将是:
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1 unknownvarname2
Run Code Online (Sandbox Code Playgroud)
对于这样大小的数据帧,这根本没有问题。但是,当我在具有最多30k和多达40个未知变量的大型数据帧上尝试此循环时,该循环会花费一些时间才能完成。
有人知道更快的方法吗?我尝试了类似之类mutate()的功能,dplyr package但无法使其正常运行。
提前谢谢了!
一种使用基数R的方法
#Replace all the values with 1:3 with blank
df[-1][sapply(df[-1], `%in%`, 1:3)] <- ""
#Get the row/column indices where value is 4
mat <- which(df == 4, arr.ind = TRUE)
#Exclude values from first column
mat <- mat[mat[, 2] != 1, ]
#Replace remaining entries with it's corresponding column names
df[mat] <- names(df)[mat[, 2]]
df
# id unknownvarname1 unknownvarname2
#1 1 unknownvarname2
#2 2
#3 3
#4 4 unknownvarname1
#5 5 unknownvarname2
#6 6
#7 7
#8 8 unknownvarname1
Run Code Online (Sandbox Code Playgroud)
只是要提供另一个选项switch(尽管,由于此函数未向量化,因此它需要嵌套sapply在中,lapply而不会使其变得“漂亮”而高效……):
基本上,switch与numericas一起使用switch(myNumberToTest, caseIfOne, caseIfTwo, ...)。
因此,您需要的是:
df[, 2:3] <- lapply(2:3, function(x) sapply(df[, x], switch, "", "", "", names(df)[x]))
df
# id unknownvarname1 unknownvarname2
#1 1 unknownvarname2
#2 2
#3 3
#4 4 unknownvarname1
#5 5 unknownvarname2
#6 6
#7 7
#8 8 unknownvarname1
Run Code Online (Sandbox Code Playgroud)
还有一个基本的R选项,在lapply中使用ifelse(仍然在列上循环,但按列进行矢量化处理):
df <- data.frame(id=1:8,unknownvarname1=c(1:4,1:4),unknownvarname2=c(4:1,4:1))
df[,2:3] <- lapply(2:3, function(x) { ifelse(df[,x] < 4, "", colnames(df)[x]) })
Run Code Online (Sandbox Code Playgroud)
给
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1
Run Code Online (Sandbox Code Playgroud)
使用R的另一个基本R可能性sweep:
idx <- df[, -1] == 4
sw <- sweep(idx, 2, 1:2, FUN = '*') + 1
df[, -1] <- c("", colnames(df[, -1]))[sw]
Run Code Online (Sandbox Code Playgroud)
这使:
Run Code Online (Sandbox Code Playgroud)> df id unknownvarname1 unknownvarname2 1 1 unknownvarname2 2 2 3 3 4 4 unknownvarname1 5 5 unknownvarname2 6 6 7 7 8 8 unknownvarname1
可以简化为:
sw <- sweep(df[, -1] == 4, 2, 1:2, FUN = '*') + 1
df[, -1] <- c("", colnames(df[, -1]))[sw]
Run Code Online (Sandbox Code Playgroud)
一个效率不高的tidyverse选择。这是低效的,因为我们需要稍后手动选择列:
to_use <- names(df)[-1]
df %>%
mutate_at(vars(contains("unknown")),list(~ifelse(.==4,
NA,
""))) -> new_df
new_df[-1] <-map2(new_df[-1], to_use,function(x,y) replace(x,is.na(x),y))
Run Code Online (Sandbox Code Playgroud)
一种较不人工的方法,但也具有不明确的缺点:
df %>%
map2(.,names(.), function(x, y) ifelse( x==4, y,"")) %>%
as.data.frame() %>%
mutate(id=row.names(.)) # might be a way around with `.id`
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1
Run Code Online (Sandbox Code Playgroud)
方法1的结果:
new_df
id unknownvarname1 unknownvarname2
1 1 unknownvarname2
2 2
3 3
4 4 unknownvarname1
5 5 unknownvarname2
6 6
7 7
8 8 unknownvarname1
Run Code Online (Sandbox Code Playgroud)
col用于排列名称和值的另一种选择:
sel <- df[-1]==4
df[-1] <- ""
df[-1][sel] <- names(df[-1])[col(df[-1])[sel]]
# id unknownvarname1 unknownvarname2
#1 1 unknownvarname2
#2 2
#3 3
#4 4 unknownvarname1
#5 5 unknownvarname2
#6 6
#7 7
#8 8 unknownvarname1
Run Code Online (Sandbox Code Playgroud)