假设我们在几列中有一个包含字符串的表'数据'.我们希望找到包含某个值的所有行的索引,或者更好的是,找到几个值中的一个.然而,该专栏未知.
我现在所做的是:
apply(df, 2, function(x) which(x == "M017"))
Run Code Online (Sandbox Code Playgroud)
其中df =
1 04.10.2009 01:24:51 M017 <NA> <NA> NA
2 04.10.2009 01:24:53 M018 <NA> <NA> NA
3 04.10.2009 01:24:54 M051 <NA> <NA> NA
4 04.10.2009 01:25:06 <NA> M016 <NA> NA
5 04.10.2009 01:25:07 <NA> M015 <NA> NA
6 04.10.2009 01:26:07 <NA> M017 <NA> NA
7 04.10.2009 01:26:27 <NA> M017 <NA> NA
8 04.10.2009 01:27:23 <NA> M017 <NA> NA
9 04.10.2009 01:27:30 <NA> M017 <NA> NA
10 04.10.2009 01:27:32 M017 <NA> <NA> NA
11 04.10.2009 01:27:34 M051 <NA> <NA> NA
Run Code Online (Sandbox Code Playgroud)
如果我们尝试查找多个值,这也有效:
apply(df, 2, function(x) which(x %in% c("M017", "M018")))
Run Code Online (Sandbox Code Playgroud)
结果是:
$`1`
integer(0)
$`2`
[1] 1 2 20
$`3`
[1] 16 17 18 19
$`4`
integer(0)
$`5`
integer(0)
Run Code Online (Sandbox Code Playgroud)
但是,处理结果列表列表相当繁琐.
有没有更有效的方法来查找任何列中包含值(或更多)的行?
kon*_*vas 25
怎么样
apply(df, 1, function(r) any(r %in% c("M017", "M018")))
Run Code Online (Sandbox Code Playgroud)
TRUE如果第i行包含其中一个值,则第i 个元素将是,FALSE否则.或者,如果您只想要行号,请将上述语句括起来which(...).
如果要查找rows向量中具有任何值的 ,一种选择是循环向量 ( lapply(v1,..)),使用( ) 创建 (TRUE/FALSE) 的逻辑索引==。使用Reduce和 OR ( |) 通过检查相应的元素将列表缩减为单个逻辑矩阵。对行求和 ( rowSums),双重否定 ( !!) 以获取具有任何匹配项的行。
indx1 <- !!rowSums(Reduce(`|`, lapply(v1, `==`, df)), na.rm=TRUE)
Run Code Online (Sandbox Code Playgroud)
或者向量化并使用whichwith获取行索引arr.ind=TRUE
indx2 <- unique(which(Vectorize(function(x) x %in% v1)(df),
arr.ind=TRUE)[,1])
Run Code Online (Sandbox Code Playgroud)
我没有使用@kristang 的解决方案,因为它给了我错误。基于1000x500矩阵,@konvas 的解决方案是最有效的(到目前为止)。但是,如果行数增加,这可能会有所不同
val <- paste0('M0', 1:1000)
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(val, NA), 1000*500,
replace=TRUE), ncol=500), stringsAsFactors=FALSE)
set.seed(356)
v1 <- sample(val, 200, replace=FALSE)
konvas <- function() {apply(df1, 1, function(r) any(r %in% v1))}
akrun1 <- function() {!!rowSums(Reduce(`|`, lapply(v1, `==`, df1)),
na.rm=TRUE)}
akrun2 <- function() {unique(which(Vectorize(function(x) x %in%
v1)(df1),arr.ind=TRUE)[,1])}
library(microbenchmark)
microbenchmark(konvas(), akrun1(), akrun2(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# konvas() 1.00000 1.000000 1.000000 1.000000 1.000000 1.00000 20
# akrun1() 160.08749 147.642721 125.085200 134.491722 151.454441 52.22737 20
# akrun2() 5.85611 5.641451 4.676836 5.330067 5.269937 2.22255 20
# cld
# a
# b
# a
Run Code Online (Sandbox Code Playgroud)
对于ncol = 10,结果略有不同:
expr min lq mean median uq max neval
konvas() 3.116722 3.081584 2.90660 2.983618 2.998343 2.394908 20
akrun1() 27.587827 26.554422 22.91664 23.628950 21.892466 18.305376 20
akrun2() 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 20
Run Code Online (Sandbox Code Playgroud)
v1 <- c('M017', 'M018')
df <- structure(list(datetime = c("04.10.2009 01:24:51",
"04.10.2009 01:24:53",
"04.10.2009 01:24:54", "04.10.2009 01:25:06", "04.10.2009 01:25:07",
"04.10.2009 01:26:07", "04.10.2009 01:26:27", "04.10.2009 01:27:23",
"04.10.2009 01:27:30", "04.10.2009 01:27:32", "04.10.2009 01:27:34"
), col1 = c("M017", "M018", "M051", "<NA>", "<NA>", "<NA>", "<NA>",
"<NA>", "<NA>", "M017", "M051"), col2 = c("<NA>", "<NA>", "<NA>",
"M016", "M015", "M017", "M017", "M017", "M017", "<NA>", "<NA>"
), col3 = c("<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>",
"<NA>", "<NA>", "<NA>", "<NA>"), col4 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA)), .Names = c("datetime", "col1", "col2",
"col3", "col4"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11"))
Run Code Online (Sandbox Code Playgroud)
这是一个dplyr选择:
library(dplyr)
# across all columns:
df %>% filter_all(any_vars(. %in% c('M017', 'M018')))
# or in only select columns:
df %>% filter_at(vars(col1, col2), any_vars(. %in% c('M017', 'M018')))
Run Code Online (Sandbox Code Playgroud)