有变量x,,y。z我想过滤掉其中任何一个没有巨大差距的(差距小于5%)。下面的代码可以模拟,但是如果我想要更多的变量进行比较,代码就会很无聊。有什么聪明的方法吗?谢谢!
library(tidyverse)
diamonds %>%
select(x, y, z) %>%
filter(abs((x - y)/ max(x, y)) < 0.05,
abs((x - z)/ max(x, z)) < 0.05,
abs((y - z)/ max(y, z)) < 0.05)
Run Code Online (Sandbox Code Playgroud)
您可以创建条件表达式列表,并将其注入filter()with!!!运算符。
nms <- diamonds %>% select(x, y, z) %>% names()\n\n# Use stringr::str_glue_data / glue::glue_data\nconds <- lapply(str_glue_data(data.frame(t(combn(nms, 2))),\n "abs(({X1}-{X2})/max({X1},{X2}))<0.05"), str2lang)\n\ndiamonds %>%\n filter(!!!conds)\nRun Code Online (Sandbox Code Playgroud)\n# # A tibble: 2,023 \xc3\x97 10\n# carat cut color clarity depth table price x y z\n# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n# 3 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n# 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n# 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n# 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47\n# 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53\n# 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49\n# 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73\n# 10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\nRun Code Online (Sandbox Code Playgroud)\nconds通话列表在哪里
# # A tibble: 2,023 \xc3\x97 10\n# carat cut color clarity depth table price x y z\n# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n# 3 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n# 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n# 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n# 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47\n# 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53\n# 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49\n# 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73\n# 10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\nRun Code Online (Sandbox Code Playgroud)\nUnit: relative\n expr min lq mean median uq max neval\n Darren 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 50\n jay.sf 1.634272 1.792466 1.714295 1.56970 1.652166 2.313737 50\n Thomas1 14.692111 14.537339 12.906745 12.44955 12.554815 10.956666 50\n Thomas2 14.475949 14.427519 12.857807 12.40311 12.398392 11.138774 50\n RB 93.001529 92.045358 82.934210 78.32732 78.497791 77.335309 50\nRun Code Online (Sandbox Code Playgroud)\n使用以下代码,应在 10,000 行中保留 633 行。
\n# [[1]]\n# abs((x - y)/max(x, y)) < 0.05\n# \n# [[2]]\n# abs((x - z)/max(x, z)) < 0.05\n# \n# [[3]]\n# abs((y - z)/max(y, z)) < 0.05\nRun Code Online (Sandbox Code Playgroud)\n
这个受益于 C++,应该运行得很快。
\n> smrt_comp <- \\(data, cmpv, thr=.05) {\n+ matrixStats::colSums2(RcppAlgos::comboGeneral(cmpv, 2, FUN=\\(x) {\n+ m <- as.matrix(data[, x])\n+ abs(matrixStats::rowDiffs(m)/max(m)) < thr\n+ }, FUN.VALUE=array(, nrow(data)))) == length(cmpv)\n+ }\n> diamonds[smrt_comp(diamonds, c('x', 'y', 'z')), ]\n# A tibble: 2,023 \xc3\x97 10\n carat cut color clarity depth table price x y z\n <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n 3 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47\n 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53\n 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49\n 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73\n10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\nRun Code Online (Sandbox Code Playgroud)\n数据:
\ndata(diamonds, package="ggplot2")\nRun Code Online (Sandbox Code Playgroud)\n
这是一个方法。
\ngap来计算相对距离;map_dfc) 绑定的 tibble 中;rowSums将找到总共有多少行TRUE,并将其与创建新列的列数进行比较keep;suppressPackageStartupMessages(\n library(tidyverse)\n)\ndata(diamonds, package = "ggplot2")\n\ngap <- function(data) {\n x <- data[[1]]\n y <- data[[2]]\n abs((x - y)/max(x, y))\n}\n\nmax_gap <- 0.05\n\ndiamonds %>% \n select(x, y, z) %>%\n combn(m = 2L, gap, simplify = FALSE) %>%\n map_dfc(\\(x) x < max_gap) %>%\n mutate(keep = rowSums(.) == ncol(.)) %>%\n bind_cols(diamonds) %>%\n filter(keep) %>%\n select(-(1:4))\n#> New names:\n#> \xe2\x80\xa2 `` -> `...1`\n#> \xe2\x80\xa2 `` -> `...2`\n#> \xe2\x80\xa2 `` -> `...3`\n#> # A tibble: 2,023 \xc3\x97 10\n#> carat cut color clarity depth table price x y z\n#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n#> 3 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n#> 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n#> 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n#> 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47\n#> 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53\n#> 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49\n#> 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73\n#> 10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46\n#> # \xe2\x84\xb9 2,013 more rows\nRun Code Online (Sandbox Code Playgroud)\n创建于 2023-12-05
\n为了使代码更通用,请使用列向量进行选择。这将避免对要在管道末端丢弃的列进行硬编码。
\ncols_to_process <- c("x", "y", "z")\n\ndiamonds %>% \n select(all_of(cols_to_process)) %>%\n combn(m = 2L, gap, simplify = FALSE) %>%\n map_dfc(\\(x) x < max_gap) %>%\n mutate(keep = rowSums(.) == ncol(.)) %>%\n bind_cols(diamonds) %>%\n filter(keep) %>%\n select(-seq_len(choose(length(cols_to_process), 2)), -keep)\nRun Code Online (Sandbox Code Playgroud)\n
你可以尝试combn+rowMeans像下面这样
diamonds %>%\n filter(\n rowMeans(\n combn(\n select(., x, y, z),\n 2,\n \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05\n )\n ) == 1\n )\nRun Code Online (Sandbox Code Playgroud)\nsubset或使用+ Reduce+的基本 R 选项combn
subset(\n diamonds,\n Reduce(\n `&`,\n combn(\n list(x, y, z),\n 2,\n \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05,\n simplify = FALSE\n )\n ) \nRun Code Online (Sandbox Code Playgroud)\n这使
\n# A tibble: 2,023 \xc3\x97 10\n carat cut color clarity depth table price x y z\n <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43\n 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31\n 3 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63\n 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75\n 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48\n 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47\n 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53\n 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49\n 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73\n10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\nRun Code Online (Sandbox Code Playgroud)\nf1 <- function() {\n diamonds %>%\n filter(\n rowMeans(\n combn(\n select(., x, y, z),\n 2,\n \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05\n )\n ) == 1\n )\n}\n\n\nf2 <- function() {\n subset(\n diamonds,\n Reduce(\n `&`,\n combn(\n list(x, y, z),\n 2,\n \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05,\n simplify = FALSE\n )\n )\n )\n}\n\nmicrobenchmark(\n f1 = f1(),\n f2 = f2(),\n unit = "relative",\n check = "equivalent",\n times = 50L\n)\nRun Code Online (Sandbox Code Playgroud)\n节目
\nUnit: relative\n expr min lq mean median uq max neval\n f1 25.91684 23.46633 17.96225 22.78245 19.55927 7.241132 50\n f2 1.00000 1.00000 1.00000 1.00000 1.00000 1.000000 50\nRun Code Online (Sandbox Code Playgroud)\n