在相同因子上汇总计数和条件聚合函数

New*_*uit 39 r dplyr

快速和缺点是我在使用相同因素的条件汇总计数和聚合函数时遇到问题.

假设我有这个数据帧:

library(dplyr)

df = tbl_df(data.frame(
    company=c("Acme", "Meca", "Emca", "Acme", "Meca", "Emca"), 
    year=c("2011", "2010", "2009", "2011", "2010", "2013"), 
    product=c("Wrench", "Hammer", "Sonic Screwdriver", "Fairy Dust", 
              "Kindness", "Helping Hand"), 
    price=c("5.67", "7.12", "12.99", "10.99", NA, FALSE)))
Run Code Online (Sandbox Code Playgroud)

这创建了这个数据帧(本质上):

   company year  product             price
1    Acme  2011  Wrench              5.67
2    Meca  2010  Hammer              7.12
3    Emca  2009  Sonic Screwdriver   12.99
4    Acme  2011  Fairy Dust          10.99
5    Meca  2010  Kindness            NA
...  ...   ...   ...                 ...
n    Emca  2013  Helping Hand        FALSE
Run Code Online (Sandbox Code Playgroud)

假设我想df <- group_by(df, company, year, product)在一个集合(即数据帧)中获取以下信息:

  1. 每个价格清单的数量(包括NA,False)
  2. 每个都有'NA'条件的计数
  3. 平均价格不包括NA和False
  4. 最高价格

    summarize(df, count = n()) #satisfies first item obviously
    
    Run Code Online (Sandbox Code Playgroud)

我在尝试获取其他人时遇到了问题.我想我需要使用管道运营商?如果是这样,有人可以提供一些指导吗?

这是我尝试过的,但它是明显错误的,但我不知道下一步该怎么做:

 summarize(df,
           total.count = n(),
           count = filter(df, is.na(price)),
           avg.price = filter(df, !is.na(price), price != FALSE),
           max.price = max(filter(df, !is.na(price), price != FALSE))
Run Code Online (Sandbox Code Playgroud)

是的,我已经审阅了文档,我确信答案已经存在,但它们可能对我的理解来说太高级了.提前致谢!

akr*_*run 53

假设您的原始数据集与您创建的数据集类似(即使用NAas character.您可以在使用na.strings时读取数据时指定read.table.但是,我想会自动检测到NA.

pricefactor需要转换为numeric类.使用时as.numeric,所有非数字元素(即"NA"FALSE)都会被强制转换NA为警告.

library(dplyr)
df %>%
     mutate(price=as.numeric(as.character(price))) %>%  
     group_by(company, year, product) %>%
     summarise(total.count=n(), 
               count=sum(is.na(price)), 
               avg.price=mean(price,na.rm=TRUE),
               max.price=max(price, na.rm=TRUE))
Run Code Online (Sandbox Code Playgroud)

数据

我使用的是相同的dataset(除了...行).

df = tbl_df(data.frame(company=c("Acme", "Meca", "Emca", "Acme", "Meca","Emca"),
 year=c("2011", "2010", "2009", "2011", "2010", "2013"), product=c("Wrench", "Hammer",
 "Sonic Screwdriver", "Fairy Dust", "Kindness", "Helping Hand"), price=c("5.67",
 "7.12", "12.99", "10.99", "NA",FALSE)))
Run Code Online (Sandbox Code Playgroud)