我目前的数据如下:
Person Team
10 100
11 100
12 100
10 200
11 200
14 200
15 200
Run Code Online (Sandbox Code Playgroud)
我想根据他们在一起的队伍来推断出彼此认识的人.我还想要计算一个团队在一个团队中的次数,我想跟踪链接每对人的团队识别码.换句话说,我想创建一个如下所示的数据集:
Person1 Person2 Count Team1 Team2 Team3
10 11 2 100 200 NA
10 12 1 100 NA NA
11 12 1 100 NA NA
10 14 1 200 NA NA
10 15 1 200 NA NA
11 14 1 200 NA NA
11 15 1 200 NA NA
Run Code Online (Sandbox Code Playgroud)
生成的数据集捕获可以根据原始数据集中概述的团队推断出的关系."Count"变量反映了一对人在一起的实例数量."Team1","Team2"和"Team3"变量列出了将每对人员彼此链接的团队ID.首先列出哪个人/团队ID与第二名相比没有区别.团队规模从2名成员到8名成员.
这是一个"data.table"解决方案,似乎可以到达你想要的地方(虽然有很多代码):
library(data.table)
dcast.data.table(
dcast.data.table(
as.data.table(d)[, combn(Person, 2), by = Team][
, ind := paste0("Person", c(1, 2))][
, time := sequence(.N), by = list(Team, ind)],
time + Team ~ ind, value.var = "V1")[
, c("count", "time") := list(.N, sequence(.N)), by = list(Person1, Person2)],
Person1 + Person2 + count ~ time, value.var = "Team")
# Person1 Person2 count 1 2
# 1: 10 11 2 100 200
# 2: 10 12 1 100 NA
# 3: 10 14 1 200 NA
# 4: 10 15 1 200 NA
# 5: 11 12 1 100 NA
# 6: 11 14 1 200 NA
# 7: 11 15 1 200 NA
# 8: 14 15 1 200 NA
Run Code Online (Sandbox Code Playgroud)
要了解上面发生的事情,这是一个循序渐进的方法:
## The following would be a long data.table with 4 columns:
## Team, V1, ind, and time
step1 <- as.data.table(d)[
, combn(Person, 2), by = Team][
, ind := paste0("Person", c(1, 2))][
, time := sequence(.N), by = list(Team, ind)]
head(step1)
# Team V1 ind time
# 1: 100 10 Person1 1
# 2: 100 11 Person2 1
# 3: 100 10 Person1 2
# 4: 100 12 Person2 2
# 5: 100 11 Person1 3
# 6: 100 12 Person2 3
## Here, we make the data "wide"
step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1")
step2
# time Team Person1 Person2
# 1: 1 100 10 11
# 2: 1 200 10 11
# 3: 2 100 10 12
# 4: 2 200 10 14
# 5: 3 100 11 12
# 6: 3 200 10 15
# 7: 4 200 11 14
# 8: 5 200 11 15
# 9: 6 200 14 15
## Create a "count" column and a "time" column,
## grouped by "Person1" and "Person2".
## Count is for the count column.
## Time is for going to a wide format
step3 <- step2[, c("count", "time") := list(.N, sequence(.N)),
by = list(Person1, Person2)]
step3
# time Team Person1 Person2 count
# 1: 1 100 10 11 2
# 2: 2 200 10 11 2
# 3: 1 100 10 12 1
# 4: 1 200 10 14 1
# 5: 1 100 11 12 1
# 6: 1 200 10 15 1
# 7: 1 200 11 14 1
# 8: 1 200 11 15 1
# 9: 1 200 14 15 1
## The final step of going wide
out <- dcast.data.table(step3, Person1 + Person2 + count ~ time,
value.var = "Team")
out
# Person1 Person2 count 1 2
# 1: 10 11 2 100 200
# 2: 10 12 1 100 NA
# 3: 10 14 1 200 NA
# 4: 10 15 1 200 NA
# 5: 11 12 1 100 NA
# 6: 11 14 1 200 NA
# 7: 11 15 1 200 NA
# 8: 14 15 1 200 NA
Run Code Online (Sandbox Code Playgroud)