由组进行的逐列串联连接

Lia*_* S. 7 string r aggregation

假设我有这个输入:

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   B
11           7    2014-06-17  2013-08-04   B
Run Code Online (Sandbox Code Playgroud)

我想str通过group变量逐步连接列的值ID,如以下输出所示:

             ID     date_1      date_2     str
1            1    2010-07-04  2008-01-20   A
2            2    2015-07-01  2011-08-31   C
3            3    2015-03-06  2013-01-18   D
4            4    2013-01-10  2011-08-30   D
5            5    2014-06-04  2011-09-18   B
6            5    2014-06-04  2011-09-18   B,B
7            6    2012-11-22  2011-09-28   C
8            7    2014-06-17  2013-08-04   A
10           7    2014-06-17  2013-08-04   A,B
11           7    2014-06-17  2013-08-04   A,B,B
Run Code Online (Sandbox Code Playgroud)

我尝试使用ave()此代码的函数:

within(table, {
  Emp_list <- ave(str, ID, FUN = function(x) paste(x, collapse = ","))
})
Run Code Online (Sandbox Code Playgroud)

但它提供了以下输出,这不是我想要的:

         ID      date_1     date_2      str
1         1    2010-07-04 2008-01-20     A
2         2    2015-07-01 2011-08-31     C
3         3    2015-03-06 2013-01-18     D
4         4    2013-01-10 2011-08-30     D
5         5    2014-06-04 2011-09-18     B,B
6         5    2014-06-04 2011-09-18     B,B
7         6    2012-11-22 2011-09-28     C
8         7    2014-06-17 2013-08-04     A,B,B
10        7    2014-06-17 2013-08-04     A,B,B
11        7    2014-06-17 2013-08-04     A,B,B
Run Code Online (Sandbox Code Playgroud)

当然,我想避免循环,因为我在大型数据库上工作.

Ric*_*ven 9

怎么样ave()Reduce().该Reduce()函数允许我们在计算结果时累积结果.因此,如果我们运行它,paste()我们可以累积粘贴的字符串.

f <- function(x) {
    Reduce(function(...) paste(..., sep = ", "), x, accumulate = TRUE)
}

df$str <- with(df, ave(as.character(str), ID, FUN = f)
Run Code Online (Sandbox Code Playgroud)

它给出了更新的数据框 df

   ID     date_1     date_2     str
1   1 2010-07-04 2008-01-20       A
2   2 2015-07-01 2011-08-31       C
3   3 2015-03-06 2013-01-18       D
4   4 2013-01-10 2011-08-30       D
5   5 2014-06-04 2011-09-18       B
6   5 2014-06-04 2011-09-18    B, B
7   6 2012-11-22 2011-09-28       C
8   7 2014-06-17 2013-08-04       A
10  7 2014-06-17 2013-08-04    A, B
11  7 2014-06-17 2013-08-04 A, B, B
Run Code Online (Sandbox Code Playgroud)

注意: function(...) paste(..., sep = ", ")也可以function(x, y) paste(x, y, sep = ", ").(感谢Pierre Lafortune)

  • @RichardScriven - 我知道你有 - 我只是在解释`function(...)`是必要的,而不是可能更明显的`function(x)` - 你说你正在弄清楚如何解释它所以我想我请进来. (2认同)

Dav*_*urg 8

这是一个可能的解决方案,结合data.table内部tapply,似乎可以得到你所需要的(你可以使用paste而不是toString如果你喜欢,它只是看起来更干净我这样).

library(data.table)
setDT(df)[, Str := tapply(str[sequence(1:.N)], rep(1:.N, 1:.N), toString), by = ID]
df
#     ID     date_1     date_2 str     Str
#  1:  1 2010-07-04 2008-01-20   A       A
#  2:  2 2015-07-01 2011-08-31   C       C
#  3:  3 2015-03-06 2013-01-18   D       D
#  4:  4 2013-01-10 2011-08-30   D       D
#  5:  5 2014-06-04 2011-09-18   B       B
#  6:  5 2014-06-04 2011-09-18   B    B, B
#  7:  6 2012-11-22 2011-09-28   C       C
#  8:  7 2014-06-17 2013-08-04   A       A
#  9:  7 2014-06-17 2013-08-04   B    A, B
# 10:  7 2014-06-17 2013-08-04   B A, B, B
Run Code Online (Sandbox Code Playgroud)

您可以使用它来改进它

setDT(df)[, Str := {Len <- 1:.N ; tapply(str[sequence(Len)], rep(Len, Len), toString)}, by = ID]
Run Code Online (Sandbox Code Playgroud)