使用R透视CSV文件

his*_*eim 2 csv r data-manipulation traminer

我有一个看起来像这样的文件:

                 type          created_at repository_name
1         IssuesEvent 2012-03-11 06:48:31       bootstrap
2         IssuesEvent 2012-03-11 06:48:31       bootstrap
3         IssuesEvent 2012-03-11 06:48:31       bootstrap
4         IssuesEvent 2012-03-11 06:52:50       bootstrap
5         IssuesEvent 2012-03-11 06:52:50       bootstrap
6         IssuesEvent 2012-03-11 06:52:50       bootstrap
7   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
8   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
9   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
10        IssuesEvent 2012-03-11 07:03:58       bootstrap
11        IssuesEvent 2012-03-11 07:03:58       bootstrap
12        IssuesEvent 2012-03-11 07:03:58       bootstrap
13         WatchEvent 2012-03-11 07:15:44       bootstrap
14         WatchEvent 2012-03-11 07:15:44       bootstrap
15         WatchEvent 2012-03-11 07:15:44       bootstrap
16         WatchEvent 2012-03-11 07:18:45        hogan.js
17         WatchEvent 2012-03-11 07:18:45        hogan.js
18         WatchEvent 2012-03-11 07:18:45        hogan.js
Run Code Online (Sandbox Code Playgroud)

我正在使用的数据集可以在https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv上访问.

我想创建一个表,其中包含"repository_name"列中每个条目的列(例如bootstrap,hogan.js).在该列中,我需要从"类型"列中获取与该条目对应的数据(即,只有当前"类型"列的行,当前"repository_name"列中的值"bootstrap"应该属于新的"bootstrap"专栏).因此:

  • 时间戳仅用于排序,不需要跨行同步(实际上它们可以删除,因为数据已根据时间戳排序)
  • 即使"IssuesEvent"重复10次,我也需要保留所有这些,因为我将使用R包TraMineR进行序列分析
  • 列可以是不等长的
  • 不同repos的列之间没有关系("repository_name")

换句话说,我想要一个看起来像这样的表:

     bootstrap            hogan.js
1    IssuesEvent          PushEvent
2    IssuesEvent          IssuesEvent
3    OssueCommentEvent    WatchEvent
Run Code Online (Sandbox Code Playgroud)

我怎样才能在R中实现这一目标?

我可以在https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/reshaping_bigqueries.R上找到一些使用reshape包的失败尝试.

小智 5

我刚加入stackoverflow; 希望我的回答有点用处.

通过表格,我假设你的意思是你想要一个数据框架.但是,列似乎不太可能具有相同的长度,并且看起来行无论如何都没有多大意义.也许列表会更好?

这是一个混乱的解决方案:

names <- unique(olddataframe$repository_name)
results <- sapply(1:length(names), function(j){
    sapply(which(olddataframe$repository_name == names[j]), function(i){
        olddataframe$type[i]
   )
})
names(results) <- names
results
Run Code Online (Sandbox Code Playgroud)


flo*_*del 5

您的样本数据:

data <- structure(list(type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("IssueCommentEvent", 
"IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
6L), .Label = c("2012-03-11 06:48:31", "2012-03-11 06:52:50", 
"2012-03-11 07:03:57", "2012-03-11 07:03:58", "2012-03-11 07:15:44", 
"2012-03-11 07:18:45"), class = "factor"), repository_name = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("bootstrap", "hogan.js"), class = "factor")), .Names = c("type", 
"created_at", "repository_name"), class = "data.frame", row.names = c(NA, 
-18L))
Run Code Online (Sandbox Code Playgroud)

我从您的预期输出中收集到,type当您为同一个created_at值显示多次时,您只需要一个输出,换句话说,您想要删除重复项:

data <- unique(data)
Run Code Online (Sandbox Code Playgroud)

然后,要按照它们出现的顺序提取所有type条目 repository_name,您只需使用:

data.split <- split(data$type, data$repository_name)
data.split
# $bootstrap
# [1] IssuesEvent       IssuesEvent       IssueCommentEvent
# [4] IssuesEvent       WatchEvent       
# Levels: IssueCommentEvent IssuesEvent WatchEvent
# 
# $hogan.js
# [1] WatchEvent
# Levels: IssueCommentEvent IssuesEvent WatchEvent
Run Code Online (Sandbox Code Playgroud)

它返回一个列表,该列表是具有不同长度的向量集合的首选R数据结构.

编辑:既然您已经提供了输出数据的示例,那么您的预期输出确实是data.frame就会变得更加明显.您可以NA使用以下函数将上面的列表转换为使用s 填充的data.frame :

list.to.df <- function(arg.list) {
   max.len  <- max(sapply(arg.list, length))
   arg.list <- lapply(arg.list, `length<-`, max.len)
   as.data.frame(arg.list)
}

df.out <- list.to.df(data.split)
df.out
#           bootstrap   hogan.js
# 1       IssuesEvent WatchEvent
# 2       IssuesEvent       <NA>
# 3 IssueCommentEvent       <NA>
# 4       IssuesEvent       <NA>
# 5        WatchEvent       <NA>
Run Code Online (Sandbox Code Playgroud)

然后,您可以使用该文件将其保存到文件中

write.csv(df.out, file = "out.csv", quote = FALSE, na = "", row.names = FALSE)
Run Code Online (Sandbox Code Playgroud)

获得与您在github上发布的格式完全相同的输出格式.