one*_*one 10 r vector sequence run-length-encoding
我有以下带有 0 和 1 的向量:
test1 <- c(rep(0,20),rep(1,5),rep(0,10),rep(1,15))
test1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
^
test2 <- c(rep(0,8),rep(1,4),rep(0,5),rep(1,5),rep(0,6),rep(1,10),rep(0,2))
test2
[1] 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0
^
Run Code Online (Sandbox Code Playgroud)
我需要找到最后一个 1 序列中第一个 1 的索引(在上面的代码中用 ^ 表示)。我有一个解决方案(如下),但性能不佳,我该如何提高性能?
对于test1和test2,预期输出分别为 36 和 29。
这是一个次优解决方案:
temp1 <- cumsum(test1)
which(temp1==max(temp1[duplicated(temp1)&temp1!=max(temp1)]+1))[1]
[1] 36
temp2 <- cumsum(test2)
which(temp2==max(temp2[duplicated(temp2)&temp2!=max(temp2)]+1))[1]
[1] 29
Run Code Online (Sandbox Code Playgroud)
注意:实际向量的长度约为 10k。
M.V*_*ing 19
该data.table库有一个名为 的非导出函数data.table:::uniqlist(list(x))。使用三个冒号:::访问非导出函数。此函数确定 data.frame 的列何时更改值并返回更改的索引。
data.table:::uniqlist(list(test1))
# [1] 1 21 26 36
Run Code Online (Sandbox Code Playgroud)
@Arun 在这里讨论uniqlist: /sf/answers/1488749811/
然后我使用y[length(y)]查找向量中最后一项的方法,并使用基数ifelse()检查最后一个索引是否包含 1,否则倒数第二个索引必须包含 1。
fx <- function(x) {
y <- data.table:::uniqlist(list(x))
ifelse(x[y[length(y)]] == 1, y[length(y)], y[length(y) - 1])
}
Run Code Online (Sandbox Code Playgroud)
zx8*_*754 18
使用rle:
r <- rle(test1)
ix <- max(which(r$values == 1))
sum(r$lengths[ 1:(ix - 1) ]) + 1
# [1] 36
r <- rle(test2)
ix <- max(which(r$values == 1))
sum(r$lengths[ 1:(ix - 1) ]) + 1
# [1] 29
Run Code Online (Sandbox Code Playgroud)
Maë*_*aël 18
另一种方式是使用which+ diff。
idx <- which(test1 == 1)
idx[tail(which(c(0, diff(idx)) != 1), 1)]
#[1] 36
Run Code Online (Sandbox Code Playgroud)
G. *_*eck 16
运行 rle,然后使用 cumsum 计算每次运行的结束位置,减去长度并加 1 以获得起始位置,然后将其减少到仅 1 的运行,最后获取最后一个元素。这给出了最后一次 1 的开始位置,但如果你想要的话:
-lengths+1==1为==0tail为head如果没有 1,则返回零长度数字向量。
with(rle(test1), tail((cumsum(lengths) - lengths + 1)[values == 1], 1))
Run Code Online (Sandbox Code Playgroud)
one*_*one 14
为了完整起见,这里是大小为 30001 的向量的基准。如果需要,请随意更新。
\nx <- c(rep(0,14736),rep(1,413),rep(0,830),rep(1,961),rep(0,274),rep(1,12787))\n\n\nmicrobenchmark::microbenchmark(rle_zx8754(x),\n rle_Grothendieck(x),\n which_diff_Ma\xc3\xabl(x),\n uniqlist_Viking(x),\n while_Ritchie(x),\n #Position_Ritchie(x),\n #detect_index_Ritchie(x),\n diff_Thomas(x),\n #regex_Thomas(x),\n #regexpr_Thomas(x),\n times = 1000, check='equal')\n\n\n\nUnit: microseconds\n expr min lq mean median uq\n rle_zx8754(x) 339.5 350.45 783.9827 357.45 375.15\n rle_Grothendieck(x) 352.7 364.75 616.2324 372.60 391.75\n which_diff_Ma\xc3\xabl(x) 264.2 274.60 404.5521 279.50 292.00\n uniqlist_Viking(x) 16.7 22.30 32.1502 25.40 30.65\n while_Ritchie(x) 777.6 785.60 1021.0738 801.95 847.15\n diff_Thomas(x) 279.4 286.90 500.6373 291.20 306.35\n max neval cld\n 156630.3 1000 cd\n 11196.5 1000 bc \n 7263.2 1000 b \n 3524.9 1000 a \n 6739.7 1000 d\n 9435.5 1000 b \nRun Code Online (Sandbox Code Playgroud)\n功能:
\nx <- c(rep(0,14736),rep(1,413),rep(0,830),rep(1,961),rep(0,274),rep(1,12787))\n\n\nrle_zx8754 <- function(x){\n r <- rle(x)\n ix <- max(which(r$values == 1))\n sum(r$lengths[ 1:(ix - 1) ]) + 1\n}\n\nwhich_diff_Ma\xc3\xabl <- function(x){\n idx <- which(x == 1)\n idx[tail(which(diff(idx) != 1), 1) + 1]\n}\n\nrle_Grothendieck <- function(x){\n with(rle(x), tail((cumsum(lengths) - lengths + 1)[values == 1], 1))\n}\n\nuniqlist_Viking <- function(x){\n y <- data.table:::uniqlist(list(x))\n ifelse(x[y[length(y)]] == 1, y[length(y)], y[length(y) - 1])\n}\n\nwhile_Ritchie <- function(x){\n l <- length(x)\n while (x[l] - x[l - 1] != 1) {\n l <- l - 1\n }\n l\n}\nPosition_Ritchie <- function(x){\n Position(isTRUE, diff(x) == 1, right = TRUE) + 1\n}\n\ndetect_index_Ritchie <- function(x){\n purrr::detect_index(diff(x) == 1, isTRUE, .dir = "backward") + 1\n}\n\ndiff_Thomas <- function(x){\n max((2:length(x))[diff(x) == 1])\n}\n\nregex_Thomas <- function(x){\n nchar(sub("(.*01).*", "\\\\1", paste0(x, collapse = "")))\n}\n\nregexpr_Thomas <- function(x){\n attr(regexpr(".*(?<=0)1", paste0(x,collapse = ""), perl = TRUE), "match.length")\n}\n\nRun Code Online (Sandbox Code Playgroud)\n
H 1*_*H 1 12
一个简单的while循环将是一种(可能非常)快速的方法,其中所寻找的索引接近向量的末尾。
f <- function(x) {
l <- length(x)
while (x[l] - x[l - 1] != 1) {
l <- l - 1
}
l
}
f(test1)
[1] 36
f(test2)
[1] 29
Run Code Online (Sandbox Code Playgroud)
我们还可以使用Position()或purrr等效的detect_index():
Position(isTRUE, diff(test1) == 1, right = TRUE) + 1
[1] 36
purrr::detect_index(diff(test1) == 1, isTRUE, .dir = "backward") + 1
[1] 36
Run Code Online (Sandbox Code Playgroud)
我相信你有很多方法可以做到这一点,以下是一些可能的方法:
regex方法你可以尝试一下regex,比如sub+nchar
f1 <- function(v) nchar(sub("(.*01).*", "\\1", paste0(v, collapse = "")))
Run Code Online (Sandbox Code Playgroud)
或者regexpr
f2 <- function(v) attr(regexpr(".*(?<=0)1", paste0(v,collapse = ""), perl = TRUE), "match.length")
Run Code Online (Sandbox Code Playgroud)
diff方法或者,其他一些diff选择,例如
f3 <- function(v) tail(which(diff(v) == 1) + 1, 1)
Run Code Online (Sandbox Code Playgroud)
和
f4 <- function(v) max((2:length(v))[diff(v) == 1])
Run Code Online (Sandbox Code Playgroud)
rev另一种使用和 的方法match。
\nrev反转向量,以便match返回第一个命中的 可以用于查找最后一个序列。
f <- \\(x) {\n . <- rev(x)\n i <- match(1, .)\n if(is.na(i)) return(NA)\n j <- match(0, tail(., -i))\n if(is.na(j)) 1\n else length(.) - i - j + 2 }\n\nf(test1)\n#[1] 36\nf(test2)\n#[1] 29\nf(c(1,1))\n#[1] 1\nf(c(0,1))\n#[1] 2\nf(c(1,0))\n#[1] 1\nf(c(0,0))\n#[1] NA\nRun Code Online (Sandbox Code Playgroud)\n或者使用相同的方法编写一个函数Rcpp,但可以从末尾开始迭代。
Rcpp::cppFunction("int f2(NumericVector x) {\n auto i = x.end();\n while(i != x.begin() && *(--i) != 1.) ;\n while(i != x.begin() && *(--i) == 1.) ;\n if(*i != 1.) ++i;\n return i == x.end() || *i != 1. ? 0 : i - x.begin() + 1;\n}")\n\nf2(test1)\n#[1] 36\nf2(test2)\n#[1] 29\nf2(c(1,1))\n#[1] 1\nf2(c(0,1))\n#[1] 2\nf2(c(1,0))\n#[1] 1\nf2(c(0,0))\n#[1] 0\nRun Code Online (Sandbox Code Playgroud)\n或者使用rev,diff和match。
f3 <- \\(x) {\n i <- match(-1, diff(rev(x)))\n if(is.finite(i)) length(x) - i + 1\n else if(x[1] == 1) 1\n else NA\n} \n\nf3(test1)\n#[1] 36\nf3(test2)\n#[1] 29\nf3(c(1,1))\n#[1] 1\nf3(c(0,1))\n#[1] 2\nf3(c(1,0))\n#[1] 1\nf3(c(0,0))\n#[1] NA\nRun Code Online (Sandbox Code Playgroud)\n基准
\nuniqlist <- function(x) { #M.Viking\n y <- data.table:::uniqlist(list(x))\n ifelse(x[y[length(y)]] == 1, y[length(y)], y[length(y) - 1]) }\n\nwhich_diff <- function(x) { #Ma\xc3\xabl\n idx <- which(x == 1)\n idx[tail(which(c(0, diff(idx)) != 1), 1)] }\nRun Code Online (Sandbox Code Playgroud)\n# Dataset from question\nx <- rep(c(0,1,0,1,0,1), c(14736,413,830,961,274,12787))\nbench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),\n uniqlist(x), f2(x) )\n# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc\n# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>\n#1 f(x) 199.07\xc2\xb5s 251.5\xc2\xb5s 3412. 1.21MB 76.3 1341 30\n#2 f3(x) 218.05\xc2\xb5s 319.61\xc2\xb5s 3144. 1.76MB 117. 1079 40\n#3 which_diff(x) 155.01\xc2\xb5s 177.53\xc2\xb5s 5518. 954.17KB 103. 2296 43\n#4 uniqlist(x) 17.04\xc2\xb5s 17.72\xc2\xb5s 55386. 1.36MB 4.04 27442 2\n#5 f2(x) 5.61\xc2\xb5s 6.13\xc2\xb5s 161213. 2.49KB 6.16 78462 3\n\n# Data with many changes between 0 and 1 and hit at end\nx <- rep(c(0,1), 1e6)\nbench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),\n uniqlist(x), f2(x) )\n# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc\n# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>\n#1 f(x) 17.97ms 19.86ms 44.6 76.29MB 50.5 23 26\n#2 f3(x) 28.77ms 32.78ms 25.6 114.44MB 52.9 14 29\n#3 which_diff(x) 14.47ms 16.91ms 52.3 68.67MB 67.8 27 35\n#4 uniqlist(x) 2.66ms 3ms 294. 7.63MB 27.8 148 14\n#5 f2(x) 1.08\xc2\xb5s 1.28\xc2\xb5s 701103. 2.49KB 21.0 100000 3\n\n# Data where hit is at beginning\nx <- c(0,1,rep(0, 1e6))\nbench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),\n uniqlist(x), f2(x) )\n# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc\n# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>\n#1 f(x) 4.34ms 6.6ms 131. 19.11MB 84.6 71 46\n#2 f3(x) 15.1ms 18.73ms 35.9 57.24MB 75.7 18 38\n#3 which_diff(x) 1.37ms 1.44ms 529. 7.63MB 93.9 265 47\n#4 uniqlist(x) 470.91\xc2\xb5s 491.54\xc2\xb5s 1994. 1.36MB 0 997 0\n#5 f2(x) 364.46\xc2\xb5s 375.08\xc2\xb5s 2649. 2.49KB 0 1325 0\n\n# Data where hit is at end\nx <- c(rep(0, 1e6),1,0)\nbench::mark(max_iterations = 1e5, f(x), f3(x), which_diff(x),\n uniqlist(x), f2(x) )\n# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc\n# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>\n#1 f(x) 10.53ms 11.33ms 69.8 38.18MB 91.8 35 46\n#2 f3(x) 14.19ms 17.18ms 37.6 57.24MB 69.3 19 35\n#3 which_diff(x) 1.38ms 1.49ms 512. 7.63MB 77.9 256 39\n#4 uniqlist(x) 479.76\xc2\xb5s 491.61\xc2\xb5s 1997. 1.36MB 0 999 0\n#5 f2(x) 1.08\xc2\xb5s 1.28\xc2\xb5s 683440. 2.49KB 27.3 100000 4\nRun Code Online (Sandbox Code Playgroud)\nRcpp 函数速度最快,分配的内存量最少。它的性能取决于在哪里可以找到匹配项。
\n