Lia*_* S. 7 string r aggregation
假设我有这个输入:
ID date_1 date_2 str
1 1 2010-07-04 2008-01-20 A
2 2 2015-07-01 2011-08-31 C
3 3 2015-03-06 2013-01-18 D
4 4 2013-01-10 2011-08-30 D
5 5 2014-06-04 2011-09-18 B
6 5 2014-06-04 2011-09-18 B
7 6 2012-11-22 2011-09-28 C
8 7 2014-06-17 2013-08-04 A
10 7 2014-06-17 2013-08-04 B
11 7 2014-06-17 2013-08-04 B
Run Code Online (Sandbox Code Playgroud)
我想str通过group变量逐步连接列的值ID,如以下输出所示:
ID date_1 date_2 str
1 1 2010-07-04 2008-01-20 A
2 2 2015-07-01 2011-08-31 C
3 3 2015-03-06 2013-01-18 D
4 4 2013-01-10 2011-08-30 D
5 5 2014-06-04 2011-09-18 B
6 5 2014-06-04 2011-09-18 B,B
7 6 2012-11-22 2011-09-28 C
8 7 2014-06-17 2013-08-04 A
10 7 2014-06-17 2013-08-04 A,B
11 7 2014-06-17 2013-08-04 A,B,B
Run Code Online (Sandbox Code Playgroud)
我尝试使用ave()此代码的函数:
within(table, {
Emp_list <- ave(str, ID, FUN = function(x) paste(x, collapse = ","))
})
Run Code Online (Sandbox Code Playgroud)
但它提供了以下输出,这不是我想要的:
ID date_1 date_2 str
1 1 2010-07-04 2008-01-20 A
2 2 2015-07-01 2011-08-31 C
3 3 2015-03-06 2013-01-18 D
4 4 2013-01-10 2011-08-30 D
5 5 2014-06-04 2011-09-18 B,B
6 5 2014-06-04 2011-09-18 B,B
7 6 2012-11-22 2011-09-28 C
8 7 2014-06-17 2013-08-04 A,B,B
10 7 2014-06-17 2013-08-04 A,B,B
11 7 2014-06-17 2013-08-04 A,B,B
Run Code Online (Sandbox Code Playgroud)
当然,我想避免循环,因为我在大型数据库上工作.
怎么样ave()用Reduce().该Reduce()函数允许我们在计算结果时累积结果.因此,如果我们运行它,paste()我们可以累积粘贴的字符串.
f <- function(x) {
Reduce(function(...) paste(..., sep = ", "), x, accumulate = TRUE)
}
df$str <- with(df, ave(as.character(str), ID, FUN = f)
Run Code Online (Sandbox Code Playgroud)
它给出了更新的数据框 df
ID date_1 date_2 str
1 1 2010-07-04 2008-01-20 A
2 2 2015-07-01 2011-08-31 C
3 3 2015-03-06 2013-01-18 D
4 4 2013-01-10 2011-08-30 D
5 5 2014-06-04 2011-09-18 B
6 5 2014-06-04 2011-09-18 B, B
7 6 2012-11-22 2011-09-28 C
8 7 2014-06-17 2013-08-04 A
10 7 2014-06-17 2013-08-04 A, B
11 7 2014-06-17 2013-08-04 A, B, B
Run Code Online (Sandbox Code Playgroud)
注意: function(...) paste(..., sep = ", ")也可以function(x, y) paste(x, y, sep = ", ").(感谢Pierre Lafortune)
这是一个可能的解决方案,结合data.table内部tapply,似乎可以得到你所需要的(你可以使用paste而不是toString如果你喜欢,它只是看起来更干净我这样).
library(data.table)
setDT(df)[, Str := tapply(str[sequence(1:.N)], rep(1:.N, 1:.N), toString), by = ID]
df
# ID date_1 date_2 str Str
# 1: 1 2010-07-04 2008-01-20 A A
# 2: 2 2015-07-01 2011-08-31 C C
# 3: 3 2015-03-06 2013-01-18 D D
# 4: 4 2013-01-10 2011-08-30 D D
# 5: 5 2014-06-04 2011-09-18 B B
# 6: 5 2014-06-04 2011-09-18 B B, B
# 7: 6 2012-11-22 2011-09-28 C C
# 8: 7 2014-06-17 2013-08-04 A A
# 9: 7 2014-06-17 2013-08-04 B A, B
# 10: 7 2014-06-17 2013-08-04 B A, B, B
Run Code Online (Sandbox Code Playgroud)
您可以使用它来改进它
setDT(df)[, Str := {Len <- 1:.N ; tapply(str[sequence(Len)], rep(Len, Len), toString)}, by = ID]
Run Code Online (Sandbox Code Playgroud)