以智能方式将所有列一一比较

and*_*ang 8 r filter dplyr

有变量x,,yz我想过滤掉其中任何一个没有巨大差距的(差距小于5%)。下面的代码可以模拟,但是如果我想要更多的变量进行比较,代码就会很无聊。有什么聪明的方法吗?谢谢!

library(tidyverse)

diamonds %>%
  select(x, y, z) %>%
  filter(abs((x - y)/ max(x, y)) < 0.05,
         abs((x - z)/ max(x, z)) < 0.05,
         abs((y - z)/ max(y, z)) < 0.05)
Run Code Online (Sandbox Code Playgroud)

Dar*_*sai 8

您可以创建条件表达式列表,并将其注入filter()with!!!运算符。

\n
nms <- diamonds %>% select(x, y, z) %>% names()\n\n# Use stringr::str_glue_data / glue::glue_data\nconds <- lapply(str_glue_data(data.frame(t(combn(nms, 2))),\n                              "abs(({X1}-{X2})/max({X1},{X2}))<0.05"), str2lang)\n\ndiamonds %>%\n  filter(!!!conds)\n
Run Code Online (Sandbox Code Playgroud)\n
输出
\n
# # A tibble: 2,023 \xc3\x97 10\n#    carat cut       color clarity depth table price     x     y     z\n#    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43\n#  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31\n#  3  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63\n#  4  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75\n#  5  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48\n#  6  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47\n#  7  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53\n#  8  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49\n#  9  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73\n# 10  0.23 Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\n
Run Code Online (Sandbox Code Playgroud)\n

conds通话列表在哪里

\n
# # A tibble: 2,023 \xc3\x97 10\n#    carat cut       color clarity depth table price     x     y     z\n#    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43\n#  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31\n#  3  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63\n#  4  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75\n#  5  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48\n#  6  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47\n#  7  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53\n#  8  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49\n#  9  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73\n# 10  0.23 Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\n
Run Code Online (Sandbox Code Playgroud)\n
\n

在更大的数据集上进行基准测试

\n
Unit: relative\n    expr       min        lq      mean   median        uq       max neval\n  Darren  1.000000  1.000000  1.000000  1.00000  1.000000  1.000000    50\n  jay.sf  1.634272  1.792466  1.714295  1.56970  1.652166  2.313737    50\n Thomas1 14.692111 14.537339 12.906745 12.44955 12.554815 10.956666    50\n Thomas2 14.475949 14.427519 12.857807 12.40311 12.398392 11.138774    50\n      RB 93.001529 92.045358 82.934210 78.32732 78.497791 77.335309    50\n
Run Code Online (Sandbox Code Playgroud)\n

使用以下代码,应在 10,000 行中保留 633 行。

\n
# [[1]]\n# abs((x - y)/max(x, y)) < 0.05\n# \n# [[2]]\n# abs((x - z)/max(x, z)) < 0.05\n# \n# [[3]]\n# abs((y - z)/max(y, z)) < 0.05\n
Run Code Online (Sandbox Code Playgroud)\n


jay*_*.sf 8

这个受益于 C++,应该运行得很快。

\n
> smrt_comp <- \\(data, cmpv, thr=.05) {\n+   matrixStats::colSums2(RcppAlgos::comboGeneral(cmpv, 2, FUN=\\(x) {\n+     m <- as.matrix(data[, x])\n+     abs(matrixStats::rowDiffs(m)/max(m)) < thr\n+   }, FUN.VALUE=array(, nrow(data)))) == length(cmpv)\n+ }\n> diamonds[smrt_comp(diamonds, c('x', 'y', 'z')), ]\n# A tibble: 2,023 \xc3\x97 10\n   carat cut       color clarity depth table price     x     y     z\n   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43\n 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31\n 3  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63\n 4  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75\n 5  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48\n 6  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47\n 7  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53\n 8  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49\n 9  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73\n10  0.23 Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\n
Run Code Online (Sandbox Code Playgroud)\n
\n

数据:

\n
data(diamonds, package="ggplot2")\n
Run Code Online (Sandbox Code Playgroud)\n

  • @DarrenTsai PS:对于基准测试,我发现 `RcppAlgos::comboGeneral` 有一个 `parallel=` 选项。它在您的基准测试中的表现可能很有趣。 (2认同)

Rui*_*das 7

这是一个方法。

\n
    \n
  • 编写一个函数gap来计算相对距离;
  • \n
  • 将函数应用于两两列的每个组合;
  • \n
  • 现在找出哪些间隙在允许的最大间隙内,并将逻辑结果放入按列 ( map_dfc) 绑定的 tibble 中;
  • \n
  • rowSums将找到总共有多少行TRUE,并将其与创建新列的列数进行比较keep
  • \n
  • 与原始数据集绑定,过滤保留和清理的值。
  • \n
\n
suppressPackageStartupMessages(\n  library(tidyverse)\n)\ndata(diamonds, package = "ggplot2")\n\ngap <- function(data) {\n  x <- data[[1]]\n  y <- data[[2]]\n  abs((x - y)/max(x, y))\n}\n\nmax_gap <- 0.05\n\ndiamonds %>% \n  select(x, y, z) %>%\n  combn(m = 2L, gap, simplify = FALSE) %>%\n  map_dfc(\\(x) x < max_gap) %>%\n  mutate(keep = rowSums(.) == ncol(.)) %>%\n  bind_cols(diamonds) %>%\n  filter(keep) %>%\n  select(-(1:4))\n#> New names:\n#> \xe2\x80\xa2 `` -> `...1`\n#> \xe2\x80\xa2 `` -> `...2`\n#> \xe2\x80\xa2 `` -> `...3`\n#> # A tibble: 2,023 \xc3\x97 10\n#>    carat cut       color clarity depth table price     x     y     z\n#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#>  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43\n#>  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31\n#>  3  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63\n#>  4  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75\n#>  5  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48\n#>  6  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47\n#>  7  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53\n#>  8  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49\n#>  9  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73\n#> 10  0.23 Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46\n#> # \xe2\x84\xb9 2,013 more rows\n
Run Code Online (Sandbox Code Playgroud)\n

创建于 2023-12-05

\n
\n

为了使代码更通用,请使用列向量进行选择。这将避免对要在管道末端丢弃的列进行硬编码。

\n
cols_to_process <- c("x", "y", "z")\n\ndiamonds %>% \n  select(all_of(cols_to_process)) %>%\n  combn(m = 2L, gap, simplify = FALSE) %>%\n  map_dfc(\\(x) x < max_gap) %>%\n  mutate(keep = rowSums(.) == ncol(.)) %>%\n  bind_cols(diamonds) %>%\n  filter(keep) %>%\n  select(-seq_len(choose(length(cols_to_process), 2)), -keep)\n
Run Code Online (Sandbox Code Playgroud)\n


Tho*_*ing 7

你可以尝试combn+rowMeans像下面这样

\n
diamonds %>%\n    filter(\n        rowMeans(\n            combn(\n                select(., x, y, z),\n                2,\n                \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05\n            )\n        ) == 1\n    )\n
Run Code Online (Sandbox Code Playgroud)\n

subset或使用+ Reduce+的基本 R 选项combn

\n
subset(\n    diamonds,\n    Reduce(\n        `&`,\n        combn(\n            list(x, y, z),\n            2,\n            \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05,\n            simplify = FALSE\n        )\n    ) \n
Run Code Online (Sandbox Code Playgroud)\n

这使

\n
# A tibble: 2,023 \xc3\x97 10\n   carat cut       color clarity depth table price     x     y     z\n   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43\n 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31\n 3  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63\n 4  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75\n 5  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48\n 6  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47\n 7  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53\n 8  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49\n 9  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73\n10  0.23 Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46\n# \xe2\x84\xb9 2,013 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\n
Run Code Online (Sandbox Code Playgroud)\n

基准

\n
f1 <- function() {\n    diamonds %>%\n        filter(\n            rowMeans(\n                combn(\n                    select(., x, y, z),\n                    2,\n                    \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05\n                )\n            ) == 1\n        )\n}\n\n\nf2 <- function() {\n    subset(\n        diamonds,\n        Reduce(\n            `&`,\n            combn(\n                list(x, y, z),\n                2,\n                \\(v) abs(v[[1]] - v[[2]]) / max(unlist(v)) < 0.05,\n                simplify = FALSE\n            )\n        )\n    )\n}\n\nmicrobenchmark(\n    f1 = f1(),\n    f2 = f2(),\n    unit = "relative",\n    check = "equivalent",\n    times = 50L\n)\n
Run Code Online (Sandbox Code Playgroud)\n

节目

\n
Unit: relative\n expr      min       lq     mean   median       uq      max neval\n   f1 25.91684 23.46633 17.96225 22.78245 19.55927 7.241132    50\n   f2  1.00000  1.00000  1.00000  1.00000  1.00000 1.000000    50\n
Run Code Online (Sandbox Code Playgroud)\n