R:按索引合并文本文档

d12*_*12n 3 r text-mining

我有一个如下所示的数据框:

_________________id ________________text______
    1   | 7821             | "some text here"
    2   | 7821             |  "here as well"
    3   | 7821             |  "and here"
    4   | 567              |   "etcetera"
    5   | 567              |    "more text"
    6   | 231              |   "other text"
Run Code Online (Sandbox Code Playgroud)

我想按ID对文本进行分组,因此我可以运行一个聚类算法:

________________id___________________text______
    1   | 7821             | "some text here here as well and here"
    2   | 567              |   "etcetera more text"
    3   | 231              |   "other text"
Run Code Online (Sandbox Code Playgroud)

有没有办法做到这一点?我从数据库表导入,我有很多数据,所以我不能手动完成.

A5C*_*2T1 10

你实际上是在寻找aggregate,而不是merge,应该有很多关于SO的例子来展示不同的聚合选项.这是最基本和最直接的方法,使用公式方法指定哪些列aggregate.

这是您的数据,可以复制并粘贴

mydata <- structure(list(id = c(7821L, 7821L, 7821L, 567L, 567L, 231L), 
    text = structure(c(6L, 3L, 1L, 2L, 4L, 5L), .Label = c("and here", 
    "etcetera", "here as well", "more text", "other text", "some text here"
    ), class = "factor")), .Names = c("id", "text"), class = "data.frame", 
    row.names = c(NA, -6L))
Run Code Online (Sandbox Code Playgroud)

这是聚合输出.

aggregate(text ~ id, mydata, paste, collapse = " ")
#     id                                 text
# 1  231                           other text
# 2  567                   etcetera more text
# 3 7821 some text here here as well and here
Run Code Online (Sandbox Code Playgroud)

当然,还有data.table,它具有良好的紧凑语法(和令人敬畏的速度):

> library(data.table)
> DT <- data.table(mydata)
> DT[, paste(text, collapse = " "), by = "id"]
     id                                   V1
1: 7821 some text here here as well and here
2:  567                   etcetera more text
3:  231                           other text
Run Code Online (Sandbox Code Playgroud)