将频率表组合成单个数据帧

lit*_*ger 6 r plyr

我有一个列表,其中每个列表项是在不同的示例文本上使用"table()"派生的单词频率表.因此,每个表的长度不同.我现在想将列表转换为单个数据框,其中每列是一个单词,每一行都是一个示例文本.这是我的数据的一个虚拟示例:

t1<-table(strsplit(tolower("this is a test in the event of a real word file you would see many more words here"), "\\W"))

t2<-table(strsplit(tolower("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"), "\\W"))

t3<-table(strsplit(tolower("Ask not what your country can do for you - ask what you can do for your country"), "\\W"))

myList <- list(t1, t2, t3)
Run Code Online (Sandbox Code Playgroud)

所以,人们会得到这种结构:

> class(myList[[3]])
[1] "table"

> myList[[3]]

        ask     can country      do     for     not    what     you    your 
  2       2       2       2       2       2       1       2       2       2
Run Code Online (Sandbox Code Playgroud)

我现在需要将此列表(myList)转换为单个数据框.我想我可以用plyr这样做,就像这里所做的那样(http://ryouready.wordpress.com/2009/01/23/r-combining-vectors-or-data-frames-of-unequal-长度为一个数据帧/),例如

library(plyr)
l <- myList
do.call(rbind.fill, l)
Run Code Online (Sandbox Code Playgroud)

但似乎我的"桌子"对象不好玩.我尝试将它们转换为dfs,也转换为矢量,但没有一个能够正常工作.

G. *_*eck 7

动物园.zoo包具有多路合并功能,可以紧凑地完成.该lapply的各成分变换myList到动物园对象,然后我们简单地将它们合并所有:

# optionally add nice names to the list
names(myList) <- paste("t", seq_along(myList), sep = "")

library(zoo)
fz <- function(x)with(as.data.frame(x, stringsAsFactors=FALSE), zoo(Freq, Var1)))
out <- do.call(merge, lapply(myList, fz))
Run Code Online (Sandbox Code Playgroud)

上述返回多元动物园系列,其中"时间"是"a","ago"等,但如果数据帧中的结果然后期望它只是一个的问题as.data.frame(out).

2.减少.这是第二种解决方案.它用Reduce在R的核心.

merge1 <- function(x, y) merge(x, y, by = 1, all = TRUE)
out <- Reduce(merge1, lapply(myList, as.data.frame, stringsAsFactors = FALSE))

# optionally add nice names
colnames(out)[-1] <- paste("t", seq_along(myList), sep = "")
Run Code Online (Sandbox Code Playgroud)

3. xtabs.这个将名称添加到列表中,然后将频率,名称和组提取为一个长向量,每个向量将它们重新组合在一起,使用xtabs:

names(myList) <- paste("t", seq_along(myList))

xtabs(Freq ~ Names + Group, data.frame(
    Freq = unlist(lapply(myList, unname)),
    Names = unlist(lapply(myList, names)),
    Group = rep(names(myList), sapply(myList, length))
))
Run Code Online (Sandbox Code Playgroud)

基准

使用rbenchmark软件包对一些解决方案进行基准测试,我们得到以下结果,表明动物园解决方案在样本数据上是最快的,并且可以说也是最简单的.

> t1<-table(strsplit(tolower("this is a test in the event of a real word file you would see many more words here"), "\\W"))
> t2<-table(strsplit(tolower("Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"), "\\W"))
> t3<-table(strsplit(tolower("Ask not what your country can do for you - ask what you can do for your country"), "\\W"))
> myList <- list(t1, t2, t3)
> 
> library(rbenchmark)
> library(zoo)
> names(myList) <- paste("t", seq_along(myList), sep = "")
> 
> benchmark(xtabs = {
+ names(myList) <- paste("t", seq_along(myList))
+ xtabs(Freq ~ Names + Group, data.frame(
+ Freq = unlist(lapply(myList, unname)),
+ Names = unlist(lapply(myList, names)),
+ Group = rep(names(myList), sapply(myList, length))
+ ))
+ },
+ zoo = {
+ fz <- function(x) with(as.data.frame(x, stringsAsFactors=FALSE), zoo(Freq, Var1))
+ do.call(merge, lapply(myList, fz))
+ },
+ Reduce = {
+ merge1 <- function(x, y) merge(x, y, by = 1, all = TRUE)
+ Reduce(merge1, lapply(myList, as.data.frame, stringsAsFactors = FALSE))
+ },
+ reshape = {
+ freqs.list <- mapply(data.frame,Words=seq_along(myList),myList,SIMPLIFY=FALSE,MoreArgs=list(stringsAsFactors=FALSE))
+ freqs.df <- do.call(rbind,freqs.list)
+ reshape(freqs.df,timevar="Words",idvar="Var1",direction="wide")
+ }, replications = 10, order = "relative", columns = c("test", "replications", "relative"))
     test replications relative
2     zoo           10 1.000000
4 reshape           10 1.090909
1   xtabs           10 1.272727
3  Reduce           10 1.272727
Run Code Online (Sandbox Code Playgroud)

增加:第二个解决方案.

增加:第三种解决方案.

增加:基准.


Gre*_*min 5

freqs.list <- mapply(data.frame,Words=seq_along(myList),myList,SIMPLIFY=FALSE,MoreArgs=list(stringsAsFactors=FALSE))
freqs.df <- do.call(rbind,freqs.list)
res <- reshape(freqs.df,timevar="Words",idvar="Var1",direction="wide")
head(res)
Run Code Online (Sandbox Code Playgroud)