选择列中的第一个新值

Bra*_*ino 0 r unique

我有一个351080个观察数据集(转置)看起来像这样:

Subject     1 1 1 2 2 3 3 3 3  
nationality G G G D D S S S S  
Run Code Online (Sandbox Code Playgroud)

有:

table(dat$Nationality)
Run Code Online (Sandbox Code Playgroud)

R只返回观察总数.我怎样才能告诉R只选择一个主题的国籍?

Ben*_*ker 5

构建数据:

dat <- data.frame(Subject = rep(1:3, each=3),
                  Nationality = rep(c("G","D","S"), each=3))
Run Code Online (Sandbox Code Playgroud)

试试这个:

with(dat,table(tapply(as.character(Nationality),
                      list(Subject),head,n=1)))
## D G S 
## 1 1 1 
Run Code Online (Sandbox Code Playgroud)
  • with()看起来该数据帧的范围内,以避免打字dat$所有的时间
  • tapply()head在vector(Nationality)的每个元素上运行指定的函数(),由groups(list(Subject))分隔,带有可选参数(n=1仅接受第一个元素).
  • as.character() 是丑陋但阻止R将因子转换为数字代码.
  • table 计算表.

你也可以试试这个:

library("dplyr")
d2 <- dat %>% group_by(Subject) %>%
              summarise(Nationality=head(Nationality,1))
table(d2$Nationality)
Run Code Online (Sandbox Code Playgroud)

测试速度:

n <- 351078 ## divisible by 3, for convenience
set.seed(101)
nat <- sample(c("G","D","S"),size=n/3,replace=TRUE)
dat <- data.frame(Subject = rep(1:(n/3),each=3),
                  Nationality = rep(nat,each=3))
system.time(tab <- with(dat,table(tapply(as.character(Nationality),
                      list(Subject),head,n=1))))
Run Code Online (Sandbox Code Playgroud)

这在我的机器上大约需要1.9秒......

另一方面

 system.time(tab2 <- with(dat,table(Nationality[!duplicated(Subject)])))
Run Code Online (Sandbox Code Playgroud)

需要大约0.02(!)秒......