仅当列存在时才执行dplyr操作

Question

仅当列存在时才执行dplyr操作

Kon*_*rad 10 r function lazy-evaluation dataframe dplyr

根据对条件dplyr求值的讨论,我想根据传入数据帧中是否存在引用列,有条件地在管道中执行一个步骤.

例

所产生的结果1)和2)应该是相同的.

现有专栏

# 1)
mtcars %>% 
  filter(am == 1) %>%
  filter(cyl == 4)

# 2)
mtcars %>%
  filter(am == 1) %>%
  {
    if("cyl" %in% names(.)) filter(cyl == 4) else .
  }

Run Code Online (Sandbox Code Playgroud)

不可用的列

# 1)
mtcars %>% 
  filter(am == 1)

# 2)    
mtcars %>%
  filter(am == 1) %>%
  {
    if("absent_column" %in% names(.)) filter(absent_column == 4) else .
  }

Run Code Online (Sandbox Code Playgroud)

问题

对于可用列,传递的对象与初始数据帧不对应.原始代码返回错误消息:

错误filter(cyl == 4):'cyl'找不到对象

我尝试了其他语法(没有运气):

>> mtcars %>%
...   filter(am == 1) %>%
...   {
...     if("cyl" %in% names(.)) filter(.$cyl == 4) else .
...   }
 Show Traceback

 Rerun with Debug
 Error in UseMethod("filter_") : 
  no applicable method for 'filter_' applied to an object of class "logical"

Run Code Online (Sandbox Code Playgroud)

跟进

我想扩大这个问题,这个问题将考虑==到filter电话会议右侧的评估.例如,下面的语法试图过滤第一个可用值.mtcars%>%

filter({
    if ("does_not_ex" %in% names(.))
      does_not_ex
    else
      NULL
  } == {
    if ("does_not_ex" %in% names(.))
      unique(.[['does_not_ex']])
    else
      NULL
  })

Run Code Online (Sandbox Code Playgroud)

预计,调用将评估错误消息:

错误filter_impl(.data, quo):结果的长度必须为32,而不是0

应用于现有列时:

mtcars %>%
  filter({
    if ("mpg" %in% names(.))
      mpg
    else
      NULL
  } == {
    if ("mpg" %in% names(.))
      unique(.[['mpg']])
    else
      NULL
  })

Run Code Online (Sandbox Code Playgroud)

它使用警告消息:

  mpg cyl disp  hp drat   wt  qsec vs am gear carb
1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Run Code Online (Sandbox Code Playgroud)

警告消息:In {:较长的对象长度不是较短对象长度的倍数

后续问题

有没有一种简洁的方法来扩展现有的语法,以便在filter呼叫的右侧获得条件评估,理想情况是保持在dplyr工作流程中？

Answer 1

Eum*_*ies 14

由于此处的作用域的工作方式,您无法从if语句中访问数据框.幸运的是,你不需要.

尝试:

mtcars %>%
  filter(am == 1) %>%
  filter({if("cyl" %in% names(.)) cyl else NULL} == 4)

Run Code Online (Sandbox Code Playgroud)

在这里,您可以使用条件中的' .'对象,以便检查列是否存在,如果存在,则可以将列返回到filter函数.

编辑:根据docendo discimus'对问题的评论,你可以访问数据框但不是隐含的 - 即你必须专门引用它 .

该解决方案不再起作用（尝试将字符串“ cyl”编辑为不存在的内容）。费利佩·杰拉德（Felipe Gerard）的回答是正确的。 (2认同)

Answer 2

Fel*_*ard 10

我知道我参加聚会迟到了，但这里的答案更符合您最初的想法：

mtcars %>%
  filter(am == 1) %>%
  {
    if("cyl" %in% names(.)) filter(., cyl == 4) else .
  }

Run Code Online (Sandbox Code Playgroud)

基本上，你错过了.in filter。请注意，这是因为管道不会添加.到，filter(expr)因为它位于由{}.

Answer 3

s_p*_*ike 10

随着across()在dplyr> 1.0.0现在可以使用any_of过滤时。将原始数据与所有列进行比较：

mtcars %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

Run Code Online (Sandbox Code Playgroud)

随着cyl去除，它抛出一个错误：

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

Run Code Online (Sandbox Code Playgroud)

使用any_of（注意你必须写"cyl"而不是cyl）：

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(across(any_of("cyl"), ~.x == 4))
#N.B. this is equivalent to just filtering by `am == 1`.

Run Code Online (Sandbox Code Playgroud)

Answer 4

bio*_*man 7

避免这个陷阱：

在忙碌的一天，人们可能会这样做：

library(dplyr)
df <- data.frame(A = 1:3, B = letters[1:3], stringsAsFactors = F)
> df %>% mutate( C = ifelse("D" %in% colnames(.), D, B)) 
# Notice the values on "C" colum. No error thrown, but the logic and result is wrong
  A B C
1 1 a a
2 2 b a
3 3 c a

Run Code Online (Sandbox Code Playgroud)

为什么？因为"D" %in% colnames(.)只返回TRUEor 的一个值FALSE，因此ifelse只运行一次。然后将值广播到整列！

正确做法：

> df %>% mutate( C = if("D" %in% colnames(.)) D else B)
  A B C
1 1 a a
2 2 b b
3 3 c c

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	5355 次
最近记录：	6 年，11 月前