聚合每列数据框的所有唯一值

bab*_*155 8 aggregate r

我有一个大型数据框(1616610行,255列),我需要根据一个键将每列的唯一值粘贴在一起.

例如:

> data = data.frame(a=c(1,1,1,2,2,3),
              b=c("apples", "oranges", "apples", "apples", "apples", "grapefruit"),
              c=c(12, 22, 22, 45, 67, 28), 
              d=c("Monday", "Monday", "Monday", "Tuesday", "Wednesday", "Tuesday"))
> data
  a          b  c         d
1 1     apples 12    Monday
2 1    oranges 22    Monday
3 1     apples 22    Monday
4 2     apples 45   Tuesday
5 2     apples 67 Wednesday
6 3 grapefruit 28   Tuesday
Run Code Online (Sandbox Code Playgroud)

我需要的是聚合255列中每一列中的每个唯一值,并为每个唯一值返回一个带逗号分隔符的新数据框.像这样:

  a               b      c                  d
1 1 apples, oranges 12, 22             Monday
2 2          apples 45, 67 Tuesday, Wednesday
3 3      grapefruit     28           Thursday
Run Code Online (Sandbox Code Playgroud)

我尝试过使用aggregate,像这样:

output <- aggregate(data, by=list(data$a), paste, collapse=", ")
Run Code Online (Sandbox Code Playgroud)

但是对于这样大小的数据框,它过于耗费时间(数小时),而且我经常需要一起杀死这个过程.最重要的是,这将汇总所有值,而不仅仅是唯一值.有没有人有任何提示:

1)如何改善大数据集的聚合时间

2)然后获得每个字段的唯一值

顺便说一下,这是我关于SO的第一篇文章,感谢您的耐心等待.

G. *_*eck 6

移出评论:

library(data.table)

dt <- as.data.table(data)
dt[, lapply(.SD, function(x) toString(unique(x))), by = a]
Run Code Online (Sandbox Code Playgroud)

赠送:

   a               b      c                  d
1: 1 apples, oranges 12, 22             Monday
2: 2          apples 45, 67 Tuesday, Wednesday
3: 3      grapefruit     28            Tuesday
Run Code Online (Sandbox Code Playgroud)


ste*_*veb 5

您可以执行以下操作dplyr

\n

编辑1

\n

更新的答案消除了使用引起的弃用警告summarise_each(从 dplyr 0.7.0 开始)。这使用summarise&across而不是summarise_each

\n
library(dplyr)\n\nfunc_paste <- function(x) paste(unique(x), collapse = ', ')\ndata %>%\n  group_by(a) %>%\n  summarise(across(everything(), func_paste))\n\n# Without "func_paste", using paste directly (from Alistaire's comment).\ndata %>%\n  group_by(a) %>%\n  summarise(across(everything(), ~ paste(unique(.), collapse = ', ')))\n\n## # A tibble: 3 \xc3\x97 4\n##       a b               c      d\n##   <dbl> <chr>           <chr>  <chr>\n## 1     1 apples, oranges 12, 22 Monday\n## 2     2 apples          45, 67 Tuesday, Wednesday\n## 3     3 grapefruit      28     Tuesday\n
Run Code Online (Sandbox Code Playgroud)\n

先前的答案,这将导致不推荐使用的警告(从 dplyr 0.7.0 开始)

\n
func_paste <- function(x) paste(unique(x), collapse = ', ')\ndata %>%\n    group_by(a) %>%\n    summarise_each(funs(func_paste))\n\n##      a               b      c                  d\n##  (dbl)           (chr)  (chr)              (chr)\n##1     1 apples, oranges 12, 22             Monday\n##2     2          apples 45, 67 Tuesday, Wednesday\n##3     3      grapefruit     28            Tuesday\n\n# Without "func_paste", using paste directly (from Alistaire's comment).\ndata %>%\n  group_by(a) %>%\n  summarise_each(funs(paste(unique(.), collapse = ', ')))\n
Run Code Online (Sandbox Code Playgroud)\n