Data.table：将功能应用于组，参考每个组中的设置值。将结果列传递给函数

Question

Data.table：将功能应用于组，参考每个组中的设置值。将结果列传递给函数

Mah*_*ero 2 grouping r ggplot2 data.table

我有长格式的数据，这些数据将按地理位置分组。我想计算每个感兴趣的变量与所有其他感兴趣的变量之间的差异。我无法在单个数据表语句中弄清楚如何有效地执行此操作，因此解决方法也随之引入了一些新错误（我修复了具有更多解决方法的错误，但在这里也将提供帮助！）。

然后，我想将结果列传递给ggplot函数，但是无法使用推荐的方法来工作，因此我使用了不推荐使用的方法。

library(data.table)
library(ggplot2)

set.seed(1)
results <- data.table(geography = rep(1:4, each = 4),
                      variable = rep(c("alpha", "bravo", "charlie", "delta"), 4),
                      statistic = rnorm(16) )

> results[c(1:4,13:16)]
   geography variable   statistic
1:         1    alpha -0.62645381
2:         1    bravo  0.18364332
3:         1  charlie -0.83562861
4:         1    delta  1.59528080
5:         4    alpha -0.62124058
6:         4    bravo -2.21469989
7:         4  charlie  1.12493092
8:         4    delta -0.04493361

base_variable <- "alpha"

Run Code Online (Sandbox Code Playgroud)

从这一点出发，理想情况下，我希望编写一个简单的代码，按地理位置分组，然后以相同的格式返回此表，但每个组中每个变量的统计信息为（base_variable-变量）。

我不知道如何执行此操作，因此下面是我的解决方法，我们欢迎您提出有关更好方法的建议。

# Convert to a wide table so we can do the subtraction by rows
results_wide <- dcast(results, geography ~ variable, value.var = "statistic")

   geography      alpha      bravo    charlie       delta
1:         1 -0.6264538  0.1836433 -0.8356286  1.59528080
2:         2  0.3295078 -0.8204684  0.4874291  0.73832471
3:         3  0.5757814 -0.3053884  1.5117812  0.38984324
4:         4 -0.6212406 -2.2146999  1.1249309 -0.04493361

this_is_a_hack <- as.data.table(lapply(results_wide[,-1], function(x) results_wide[, ..base_variable] - x))

   alpha.alpha bravo.alpha charlie.alpha delta.alpha
1:           0  -0.8100971     0.2091748  -2.2217346
2:           0   1.1499762    -0.1579213  -0.4088169
3:           0   0.8811697    -0.9359998   0.1859381
4:           0   1.5934593    -1.7461715  -0.5763070

Run Code Online (Sandbox Code Playgroud)

现在名称混乱了，我们没有地理位置。为什么这样的名字？另外，需要重新添加地理位置。

this_is_a_hack[, geography := results_wide[, geography] ]

normalise_these_names <- colnames(this_is_a_hack)
#Regex approach. Hacky and situational. 
new_names <- sub("\\.(.*)", "", normalise_these_names[normalise_these_names != "geography"] )
normalise_these_names[normalise_these_names != "geography"] <- new_names
#Makes use of the fact that geographies will appear last in the data.table, not generalisable approach.
colnames(this_is_a_hack) <- normalise_these_names

Run Code Online (Sandbox Code Playgroud)

我不再需要基本变量，因为所有值均为零，因此我尝试删除它，但是我似乎无法以通常的方式做到这一点：

this_is_a_hack[, ..base_variable := NULL] 
Warning message:
In `[.data.table`(this_is_a_hack, , `:=`(..base_variable, NULL)) :
  Column '..base_variable' does not exist to remove

library(dplyr)
this_is_a_hack <- select(this_is_a_hack, -base_variable)

final_result <- melt(this_is_a_hack, id.vars = "geography")

> final_result[c(1:4,9:12)]
   geography variable      value
1:         1    bravo -0.8100971
2:         2    bravo  1.1499762
3:         3    bravo  0.8811697
4:         4    bravo  1.5934593
5:         1    delta -2.2217346
6:         2    delta -0.4088169
7:         3    delta  0.1859381
8:         4    delta -0.5763070

Run Code Online (Sandbox Code Playgroud)

现在就可以将数据可视化了。我正在尝试将这些变量传递到绘图函数中，但是与dataframes相比，引用data.table列似乎很困难。显然，您应该使用quosure来将data.table变量传递到函数中，但是这只是出错了，因此我改用了不推荐使用的'aes_string'函数-对此也有所帮助。

plott <- function(dataset, varx, vary, fillby) {
  # varx <- ensym(varx)
  # vary <- ensym(vary)
  # vary <- ensym(fillby)
  ggplot(dataset, 
         aes_string(x = varx, y = vary, color = fillby)) + 
    geom_point()
}

plott(dataset = final_result,
      varx = "geography",
      vary = "value",
      fillby = "variable")

# Error I get when I try the ensym(...) method in the function:
Don't know how to automatically pick scale for object of type name. Defaulting to continuous. (this message happens 3 times)
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = varx, y = vary, colour = fillby. 
Did you mistype the name of a data column or forget to add stat()?

Run Code Online (Sandbox Code Playgroud)

Answer 1

akr*_*run 8

一种选择是通过基于“变量”创建逻辑条件来对“统计”进行子集化，其中“ base_variable”元素按“地理位置”分组

results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
       by = geography][variable != base_variable]
# geography variable       diff
# 1:         1    bravo  0.8100971
# 2:         1  charlie -0.2091748
# 3:         1    delta  2.2217346
# 4:         2    bravo -1.1499762
# 5:         2  charlie  0.1579213
# 6:         2    delta  0.4088169
# 7:         3    bravo -0.8811697
# 8:         3  charlie  0.9359998
# 9:         3    delta -0.1859381
#10:         4    bravo -1.5934593
#11:         4  charlie  1.7461715
#12:         4    delta  0.5763070

Run Code Online (Sandbox Code Playgroud)

哇！您花了几分钟时间解决了我一个多小时的问题，在另一列中对统计数据进行设置非常合理（并且比我设想的某种滞后/平移解决方案要好）。谢谢=） (2认同)

Answer 2

Ice*_*can 7

这种事情也可以通过联接来完成。以我的经验，对于较小的表（如本例），“子集变量+分组”方法通常更快，而在具有数百万行的情况下，联接方法则更快。

results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][]

#     geography variable   statistic       diff
#  1:         1    bravo  0.18364332  0.8100971
#  2:         1  charlie -0.83562861 -0.2091748
#  3:         1    delta  1.59528080  2.2217346
#  4:         2    bravo -0.82046838 -1.1499762
#  5:         2  charlie  0.48742905  0.1579213
#  6:         2    delta  0.73832471  0.4088169
#  7:         3    bravo -0.30538839 -0.8811697
#  8:         3  charlie  1.51178117  0.9359998
#  9:         3    delta  0.38984324 -0.1859381
# 10:         4    bravo -2.21469989 -1.5934593
# 11:         4  charlie  1.12493092  1.7461715
# 12:         4    delta -0.04493361  0.5763070

Run Code Online (Sandbox Code Playgroud)

两个基准

library(microbenchmark)
microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)

# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval cld
#  use_group 1.624204 1.801434 2.143670 2.212306 2.391793 2.654357    10  a 
#   use_join 6.297842 6.808610 7.626004 7.729634 8.337635 8.708916    10   b

results <- results[rep(1:.N, 1e4)][, geography := rleid(geography)]

microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)


# Unit: milliseconds
#       expr      min        lq      mean    median        uq      max neval cld
#  use_group 97.42187 106.80935 132.42537 120.64893 143.03045 208.1996    10   b
#   use_join 19.88511  21.86214  26.22012  25.82972  29.29885  36.0853    10  a

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，6 月前
查看次数：	75 次
最近记录：	6 年，6 月前