使用dplyr基于列值对R中的值求和

sta*_*uyz 9 r dataframe dplyr

我有一个包含以下信息的数据集:

Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1
Run Code Online (Sandbox Code Playgroud)

如果UniqueNumber的值> 0,我想将dplyr的值与第1行到UniqueNumber中的每个主题相加并计算均值.因此对于Subject 001,sum = 2并且mean = .67.

total = 0;
average = 0;
for(i in 1:length(Data$Subject)){
   for(j in 1:ncols(Data)){
   if(Data$UniqueNumber[i] > 0){
    total[i] = sum(Data[i,1:j])
    average[i] = mean(Data[i,1:j])
   }
}
Run Code Online (Sandbox Code Playgroud)

编辑:我只想查看"UniqueNumber"列中列出的列数.所以这循环遍历每一行并停在'UniqueNumber'中列出的列.示例:带有Subject 002的第2行应该将"Value1"和"Value2"列中的值相加,而带有Subject 003的第3行应该只对"Value1"列中的值求和.

Dav*_*urg 9

不是一个整齐的粉丝/专家,但我会尝试使用长格式.然后,只按每个组的行索引进行过滤,然后在单个列上运行您想要的任何函数(这样更容易).

library(tidyr)
library(dplyr)

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% # long format
  group_by(Subject) %>% # group by Subject in order to get row counts
  filter(row_number() <= UniqueNumber) %>% # filter by row index
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup() 

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1
Run Code Online (Sandbox Code Playgroud)

实现此目的的一种非常类似的方法可能是通过列名中的整数进行过滤.过滤器步骤在group_by它之前,所以它可能会提高性能(或不是?)但它不那么健壮,因为我假设感兴趣的cols被调用"Value#"

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% #long format
  filter(as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber) %>% #filter
  group_by(Subject) %>% # group by Subject
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup()

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1
Run Code Online (Sandbox Code Playgroud)

只是为了好玩,添加一个data.table解决方案

library(data.table)

data.table(Data) %>% 
  melt(id = c("Subject", "UniqueNumber")) %>%
  .[as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber,
    .(Mean = round(mean(value), 3), Total = sum(value)),
    by = Subject]

#    Subject  Mean Total
# 1:       1 0.667     2
# 2:       2 0.500     1
# 3:       3 1.000     1
Run Code Online (Sandbox Code Playgroud)


Paw*_*ros 2

检查这个解决方案:

df %>%
  gather(key, val, Value1:Value3) %>%
  group_by(Subject) %>%
  mutate(
    Sum = sum(val[c(1:(UniqueNumber[1]))]),
    Mean = mean(val[c(1:(UniqueNumber[1]))]),
  ) %>%
  spread(key, val)
Run Code Online (Sandbox Code Playgroud)

输出:

 Subject UniqueNumber   Sum  Mean Value1 Value2 Value3
  <chr>          <int> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 001                3     2 0.667      1      0      1
2 002                2     1 0.5        0      1      1
3 003                1     1 1          1      1      1
Run Code Online (Sandbox Code Playgroud)

  • 这究竟如何给出正确的结果?当我将随机 NA 插入数据中时,这会给出错误的结果。例如,尝试将“NA”插入到第一行的“Value1”中。 (3认同)