Jos*_*hua 10 grouping r dataframe
我有一个看起来像这个玩具示例的数据集。数据描述了一个人搬迁到的位置以及自搬迁发生以来的时间。例如,人 1 从农村开始,但在 463 天前搬到了城市(第 2 行),在 415 天前从这个城市搬到了城镇(第 3 行)等。
set.seed(123)
df <- as.data.frame(sample.int(1000, 10))
colnames(df) <- "time"
df$destination <- as.factor(sample(c("city", "town", "rural"), size = 10, replace = TRUE, prob = c(.50, .25, .25)))
df$user <- sample.int(3, 10, replace = TRUE)
df[order(df[,"user"], -df[,"time"]), ]
Run Code Online (Sandbox Code Playgroud)
数据:
time destination user
526 rural 1
463 city 1
415 town 1
299 city 1
179 rural 1
938 town 2
229 town 2
118 city 2
818 city 3
195 city 3
Run Code Online (Sandbox Code Playgroud)
我希望将此数据汇总为以下格式。即,计数类型重定位的每个用户,并总结它到一个矩阵。我如何实现这一点(最好不编写循环)?
from to count
city city 1
city town 1
city rural 1
town city 2
town town 1
town rural 0
rural city 1
rural town 0
rural rural 0
Run Code Online (Sandbox Code Playgroud)
基于data.table
包的一种可能方式:
library(data.table)
cases <- unique(df$destination)
setDT(df)[, .(from=destination, to=shift(destination, -1)), by=user
][CJ(from=cases, to=cases), .(count=.N), by=.EACHI, on=c("from", "to")]
# from to count
# <char> <char> <int>
# 1: city city 1
# 2: city rural 1
# 3: city town 1
# 4: rural city 1
# 5: rural rural 0
# 6: rural town 0
# 7: town city 2
# 8: town rural 0
# 9: town town 1
Run Code Online (Sandbox Code Playgroud)
这是一个data.table
选项
setDT(df)[
,
setNames(
rev(data.frame(embed(as.character(destination), 2))),
c("from", "to")
), user
][, count := .N, .(from, to)][]
Run Code Online (Sandbox Code Playgroud)
这使
user from to count
1: 1 rural city 1
2: 1 city town 1
3: 1 town city 2
4: 1 city rural 1
5: 2 town town 1
6: 2 town city 2
7: 3 city city 1
Run Code Online (Sandbox Code Playgroud)
这是一个tidyverse
解决方案:
library(dplyr)
library(purrr)
df %>%
group_split(user) %>%
map_dfr(~ bind_cols(as.character(.x[["destination"]][-nrow(.x)]),
as.character(.x[["destination"]][-1])) %>%
set_names("from", "to")) %>%
group_by(from, to) %>%
count()
# A tibble: 6 x 3
# Groups: from, to [6]
from to n
<chr> <chr> <int>
1 city city 1
2 city rural 1
3 city town 1
4 rural city 1
5 town city 2
6 town town 1
Run Code Online (Sandbox Code Playgroud)
这是dplyr
唯一的解决方案:
lag
函数标识 from to并结合paste0
to 一helper
列。lead
add_count
改变一n
列df %>%
group_by(user) %>%
rename(from = destination) %>%
mutate(to = lead(from), .before=3) %>%
mutate(helper = paste0(from, to)) %>%
filter(!is.na(to)) %>%
group_by(helper) %>%
add_count(helper, from, to) %>%
ungroup() %>%
select(user, from, to, n)
Run Code Online (Sandbox Code Playgroud)
输出:
user from to n
<int> <fct> <fct> <int>
1 1 rural city 1
2 1 city town 1
3 1 town city 2
4 1 city rural 1
5 2 town town 1
6 2 town city 2
7 3 city city 1
Run Code Online (Sandbox Code Playgroud)