大数据框架中的高效字符串值计数

Question

大数据框架中的高效字符串值计数

我有一个大型数据帧(~600K行)和一个字符串值列(链接)

doc_id,link
1,http://example.com
1,http://example.com
2,http://test1.net
2,http://test2.net
2,http://test5.net
3,http://test1.net
3,http://example.com
4,http://test5.net

Run Code Online (Sandbox Code Playgroud)

我想计算帧中某个字符串值出现的次数.结果应如下所示:

link, count
http://example.com, 3
http://test1.net, 2
http://test2.net, 1
http://test5.net, 2

Run Code Online (Sandbox Code Playgroud)

在R中有一种有效的方法吗？由于帧大小,将帧转换为矩阵不起作用.目前我正在使用plyr包,但这太慢了.

Answer 1

Tom*_*mmy 5

该table函数计算出现次数 - 与之相比非常快ddply.所以,这样的事情或许:

# some sample data
set.seed(42)
df <- data.frame(doc_id=1:10, link=sample(letters[1:3], 10, replace=TRUE))

cnt <- as.data.frame(table(df$link))
# Assign appropriate names (optional)
names(cnt) <- c("link", "count")
cnt

Run Code Online (Sandbox Code Playgroud)

其中给出了以下输出:

  link count
1    a     2
2    b     3
3    c     5

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，4 月前
查看次数：	751 次
最近记录：	14 年，4 月前