按组按降序连接值

use*_*906 3 r plyr data.table

我想要一个数据.我的数据看起来像

author_id paper_id prob
   731    24943    1
   731    24943    1
   731   688974    1
   731   964345    .8
   731  1201905    .9
   731  1267992    1
   736    249      .2
   736   6889      1
   736   94345    .7
   736  1201905    .9
   736  126992    .8
Run Code Online (Sandbox Code Playgroud)

我希望的输出是:

author_id    paper_id
  731        24943,24943,688974,1201905,964345
  736        6889,1201945,126992,94345,249
Run Code Online (Sandbox Code Playgroud)

即paper_id根据概率的降序排列.

如果我使用sql和R的组合,我认为解决方案将是

statement<-"select * from A 
            GROUP BY author_id
            ORDER BY prob"
Run Code Online (Sandbox Code Playgroud)

然后在R中使用粘贴一次为paper_id设置顺序.

但是我需要R.的完整解决方案.这可以做到吗?

谢谢

Dav*_*urg 10

如果temp是你的数据集,那么就做

library(data.table)
setDT(temp)[order(-prob), list(paper_id = paste0(paper_id, collapse=", ")), by = author_id]
##    author_id                                       paper_id
## 1:       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2:       736              6889, 1201905, 126992, 94345, 249
Run Code Online (Sandbox Code Playgroud)

编辑:2014年8月11日

由于data.tablev> = 1.9.4,你可以使用非常有效的setorder而不是order

str(temp)
setorder(setDT(temp), -prob)[, list(paper_id = paste0(paper_id, collapse=", ")), by = author_id]
##    author_id                                       paper_id
## 1:       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2:       736              6889, 1201905, 126992, 94345, 249
Run Code Online (Sandbox Code Playgroud)

作为旁注,整个事情也很容易用基数R完成(虽然不推荐用于大数据集)

aggregate(paper_id ~ author_id, temp[order(-temp$prob), ], paste, collapse = ", ")
#   author_id                                       paper_id
# 1       731 24943, 24943, 688974, 1267992, 1201905, 964345
# 2       736              6889, 1201905, 126992, 94345, 249
Run Code Online (Sandbox Code Playgroud)

  • +1,或稍微简单的`data.table(df)[order(-prob),paste0(paper_id,collapse =","),by = author_id]` (4认同)

had*_*ley 6

要完成设置,这里有一个dplyr答案:

df  <- read.table(header = T, text =
"author_id paper_id prob
731 24943 1
731 24943 1
731 688974 1
731 964345 .8
731 1201905 .9
731 1267992 1
736 249 .2
736 6889 1
736 94345 .7
736 1201905 .9
736 126992 .8") # your dataset

library(dplyr)
df %>%
  group_by(author_id) %>%
  arrange(desc(prob)) %>%
  summarise(paper_id = paste(paper_id, collapse = ", "))

## Source: local data frame [2 x 2]
## 
##   author_id                                       paper_id
## 1       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2       736              6889, 1201905, 126992, 94345, 249
Run Code Online (Sandbox Code Playgroud)