涉及行特定元素和整个组元素的特定组计算

Question

涉及行特定元素和整个组元素的特定组计算

and*_*ewH 4 iteration dictionary r dplyr tidyverse

将这个问题的逻辑与的逻辑相匹配时，我遇到了一些麻烦dplyr。通常，如果要将组减少为每个组一个数字，请使用summarise；如果要为每行计算一个单独的数字，请使用mutate。但是，如果要对每一行的组进行计算怎么办？

在下面的示例中，mloc包含一个指向的指针pnum，目标是添加一个新列nm_child，该列针对每一行计算mloc指向组内行（即具有与该行相同的值）的组内值的数量pnum中的索引。如果使用嵌套循环，或者map如果我知道如何对每个组进行迭代1），按每个元素进行2）和3）将映射输出作为组中的列返回，则将很容易做到这一点。

library(tidyverse)

ser    <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
pnum   <- c(1:5, 1:6)
mloc   <- c(0, 2, 2, 0, 3, 1, 1, 0, 0, 3, 4)

tb1 <- tibble(ser,pnum,  mloc)
tb2 <- tb1 %>%
group_by(ser) %>%
mutate(nm_child = sum(pnum == mloc))

Run Code Online (Sandbox Code Playgroud)

上面的值nm_child始终=1。我明白了为什么它不起作用，但是我看不出为什么它能起作用。

我也试过

mutate(nm_child = count(pnum == mloc))

Run Code Online (Sandbox Code Playgroud)

（返回

no applicable method for 'groups' applied to an object of class "logical")

Run Code Online (Sandbox Code Playgroud)

以及其他各种东西。我确实通过添加几列中间值并使用一堆嵌套的ifelse（）来完成一件事，但是在我的900万行上花费了20多分钟的时间-与之相反，例如，回归分析和大多数简单的dplyr操作，这些操作在几秒钟到很短的时间内变化不大。

所需的输出：

tb2$nm_child = c(0, 2, 1, 0, 0, 2, 0, 1, 1, 0, 0)

Run Code Online (Sandbox Code Playgroud)

Answer 1

the*_*ail 5

这是的汇总ser + mloc，然后左联接返回原始数据。无需遍历每个值：

tb1 %>%
  group_by(ser, mloc) %>%
  summarise(nm_child=n()) %>%
  left_join(tb1, ., by=c("ser"="ser","pnum"="mloc"))

## A tibble: 11 x 4
#     ser  pnum  mloc nm_child
#   <dbl> <dbl> <dbl>    <int>
# 1  1.00  1.00  0          NA
# 2  1.00  2.00  2.00        2
# 3  1.00  3.00  2.00        1
# 4  1.00  4.00  0          NA
# 5  1.00  5.00  3.00       NA
# 6  2.00  1.00  1.00        2
# 7  2.00  2.00  1.00       NA
# 8  2.00  3.00  0           1
# 9  2.00  4.00  0           1
#10  2.00  5.00  3.00       NA
#11  2.00  6.00  4.00       NA

Run Code Online (Sandbox Code Playgroud)

这将更加有效：

# big example
tb1 <- tb1[rep(1:11,5e4),]
tb1$ser <- rep(1:1e5, rep(5:6,5e4))

system.time({
tb1 %>% 
  group_by(ser) %>% 
  mutate(
    nm_child = sapply(pnum, function(x) sum(x == mloc))
  )
})
#   user  system elapsed 
#   8.83    0.06    8.97     

system.time({
tb1 %>%
  group_by(ser, mloc) %>%
  summarise(nm_child=n()) %>%
  left_join(tb1, ., by=c("ser"="ser","pnum"="mloc"))
})
#   user  system elapsed 
#   0.67    0.02    0.69

Run Code Online (Sandbox Code Playgroud)

在基本R逻辑中，这将类似于：

tabu <- aggregate(cbind(nm_child=mloc) ~ ser + mloc, tb1, FUN=length)
merge(tb1, tabu, by.x=c("ser","pnum"), by.y=c("ser","mloc"), all.x=TRUE)

Run Code Online (Sandbox Code Playgroud)

并将其四舍五入data.table，这将再快一个数量级：

tb1[tb1[, .N, by=.(ser,mloc)], on=c("ser","pnum"="mloc"), nm_child := N]

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，9 月前
查看次数：	87 次
最近记录：	6 年，9 月前