将data.frame列名传递给函数

kmm*_*kmm 105 r dataframe r-faq

我正在尝试编写一个函数来接受data.frame(x)和a column.该函数对x执行一些计算,然后返回另一个data.frame.我坚持使用最佳实践方法将列名传递给函数.

两个最小的例子fun1和fun2下面产生所需的结果,能够执行操作x$column,使用max(),例如,然而,两者都依赖于看似(至少对我而言)不优雅

打电话给substitute()可能eval()
需要将列名称作为字符向量传递.

fun1 <- function(x, column){
  do.call("max", list(substitute(x[a], list(a = column))))
}

fun2 <- function(x, column){
  max(eval((substitute(x[a], list(a = column)))))
}

df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")

Run Code Online (Sandbox Code Playgroud)

我希望能够将该功能称为fun(df, B)例如.我考虑但尚未尝试的其他选项:

传递column为列号的整数.我认为这会避免substitute().理想情况下,该功能可以接受.
with(x, get(column))但是,即使它有效,我认为这仍然需要 substitute
利用formula()和match.call(),我都没有多少经验.

子问题:do.call()首选eval()？

您可以直接使用列名称:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))

Run Code Online (Sandbox Code Playgroud)

没有必要使用替代品,评估等.

您甚至可以将所需的函数作为参数传递:

fun1 <- function(x, column, fn) {
  fn(x[,column])
}
fun1(df, "B", max)

Run Code Online (Sandbox Code Playgroud)

或者,使用[[也可以一次选择一列:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[[column]])
}
fun1(df, "B")

Run Code Online (Sandbox Code Playgroud)

有没有办法将列名称作为字符串传递？ (11认同)
谢谢！我发现`[[`解决方案是唯一对我有用的解决方案。 (3认同)
您需要传递作为字符引用的列名称或列的整数索引.只要传递"B"就会假设B本身就是一个物体. (2认同)

这个答案将涵盖许多与现有答案相同的元素,但是这个问题(将列名传递给函数)经常出现,我希望有一个更全面地涵盖事情的答案.

假设我们有一个非常简单的数据框:

dat <- data.frame(x = 1:4,
                  y = 5:8)

Run Code Online (Sandbox Code Playgroud)

我们想编写一个函数来创建一个新列z,它是列x和的总和y.

这里一个非常常见的障碍是自然(但不正确)的尝试通常看起来像这样:

foo <- function(df,col_name,col1,col2){
      df$col_name <- df$col1 + df$col2
      df
}

#Call foo() like this:    
foo(dat,z,x,y)

Run Code Online (Sandbox Code Playgroud)

这里的问题是df$col1不评估表达式col1.它只是在df字面上调用一个列col1.?Extract"递归(类似列表)对象"一节中描述了此行为.

最简单,最常推荐的解决方案是简单地切换$到[[并将函数参数作为字符串传递:

new_column1 <- function(df,col_name,col1,col2){
    #Create new column col_name as sum of col1 and col2
    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column1(dat,"z","x","y")
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

Run Code Online (Sandbox Code Playgroud)

这通常被认为是"最佳实践",因为它是最难搞砸的方法.将列名称作为字符串传递与您可以得到的一样明确.

以下两个选项更先进.许多流行软件的使用这类技术,但使用起来也需要更多的谨慎态度和技能,因为他们可以引入微妙的复杂性和失败的意料之外点.Hadley's Advanced R书的这一部分是其中一些问题的绝佳参考.

如果你真的想要保存用户输入所有这些引号,一个选项可能是使用以下命令将裸的,不带引号的列名转换为字符串deparse(substitute()):

new_column2 <- function(df,col_name,col1,col2){
    col_name <- deparse(substitute(col_name))
    col1 <- deparse(substitute(col1))
    col2 <- deparse(substitute(col2))

    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column2(dat,z,x,y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

Run Code Online (Sandbox Code Playgroud)

坦率地说,这可能有点愚蠢,因为我们真的在做同样的事情new_column1,只是需要一些额外的工作来将裸名称转换为字符串.

最后,如果我们想得到真正的幻想,我们可能会决定,而不是传递两列的名称来添加,我们希望更灵活,并允许两个变量的其他组合.在这种情况下,我们可能会使用eval()涉及两列的表达式:

new_column3 <- function(df,col_name,expr){
    col_name <- deparse(substitute(col_name))
    df[[col_name]] <- eval(substitute(expr),df,parent.frame())
    df
}

Run Code Online (Sandbox Code Playgroud)

只是为了好玩,我仍然使用deparse(substitute())新列的名称.在这里,以下所有方法都有效:

> new_column3(dat,z,x+y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
  x y  z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
  x y  z
1 1 5  5
2 2 6 12
3 3 7 21
4 4 8 32

Run Code Online (Sandbox Code Playgroud)

所以简短的答案基本上是:将data.frame列名作为字符串传递,并用于[[选择单列.只有开始钻研eval,substitute等等.如果你真的知道自己在做什么.

不知道为什么这不是选定的最佳答案。 (2认同)
如果我想传递列进行整齐选择该怎么办？我想创建一个使用 `pivot_longer` 的函数。我的函数看起来像这样 `lineplots <- function(df, colname){ ggplot(data = df %>%ivot_longer(-colname), aes(x = colname, y = value)) + geom_point() + facet_grid(rows = vars(name), scales = "free_y") }'` 并且它没有像我期望的那样工作 (2认同)

就个人而言,我认为将列作为字符串传递是非常难看的.我喜欢做类似的事情:

get.max <- function(column,data=NULL){
    column<-eval(substitute(column),data, parent.frame())
    max(column)
}

Run Code Online (Sandbox Code Playgroud)

这将产生:

> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5

Run Code Online (Sandbox Code Playgroud)

请注意data.frame的规范是如何可选的.您甚至可以使用列的功能:

> get.max(1/mpg,mtcars)
[1] 0.09615385

Run Code Online (Sandbox Code Playgroud)

我很高兴看到更好的方式,但我没有看到这和qplot之间的区别(x = mpg,data = mtcars).ggplot2永远不会将列作为字符串传递,我认为它更适合它.为什么你说这只能以交互方式使用？在什么情况下会导致不良后果？如何编程更难？在帖子的正文中,我展示了它的灵活性. (27认同)
7年后:是不是仍然使用丑陋的报价？ (13认同)
你需要摆脱使用引号思考的习惯是丑陋的.不使用它们很难看!为什么？因为您创建了一个只能以交互方式使用的功能 - 使用它进行编程非常困难. (7认同)
5年后 - )..为什么我们需要:parent.frame()？ (3认同)

另一种方法是使用tidy evaluation方法。以字符串或裸列名称的形式传递数据帧的列非常简单。tidyeval 在这里查看更多信息。

library(rlang)
library(tidyverse)

set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))

Run Code Online (Sandbox Code Playgroud)

使用列名作为字符串

fun3 <- function(x, ...) {
  # capture strings and create variables
  dots <- ensyms(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun3(df, "B")
#>          B
#> 1 1.715065

fun3(df, "B", "D")
#>          B        D
#> 1 1.715065 1.786913

Run Code Online (Sandbox Code Playgroud)

使用裸列名称

fun4 <- function(x, ...) {
  # capture expressions and create quosures
  dots <- enquos(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun4(df, B)
#>          B
#> 1 1.715065

fun4(df, B, D)
#>          B        D
#> 1 1.715065 1.786913
#>

Run Code Online (Sandbox Code Playgroud)

^{由reprex软件包（v0.2.1.9000）创建于2019-03-01}

有了dplyr它现在也可以通过简单地使用双大括号来访问一个数据帧的特定列{{...}}函数体，例如用于围绕内所需的列名col_name：

library(tidyverse)

fun <- function(df, col_name){
   df %>% 
     filter({{col_name}} == "test_string")
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，5 月前
查看次数：	100610 次
最近记录：	6 年，6 月前

用于访问列表或数据框元素的方括号[]和双括号[[]]之间的区别 490

更多相关链接

如何告诉CRAN自动安装包依赖项？ 63

大写的第一个字母 40

什么是有效的方法来调试R中加载的Rcpp编译代码(在OS X Mavericks上)？ 29

变量前的美元符号 25

R检测Blas版本 23

将多个csv文件从一个文件夹读入R中的单个数据帧 19

判断=是否在R代码中赋值的可靠方法？ 14

使用R中的spplot在图形上绘制多个shp文件 10

如何将带有日期时间的DataFrame列分为两列：一列带有日期，另一列带有一天中的时间？ 5

从 .loc 查询返回字符串 2

如何使用JavaScript复制到剪贴板？ 3131

如何在Java中将String转换为int？ 2882

如何在Git中获取当前分支名称？ 2321

如何从YouTube API获取YouTube视频缩略图？ 2291

从C#中的枚举中获取int值 1698

如何将堆栈跟踪转换为字符串？ 1435

如何在回调中访问正确的`this`？ 1309

将项目导入Eclipse后,"必须覆盖超类方法"错误 1223

什么是Python 3相当于"python -m SimpleHTTPServer" 1124

NP,NP-Complete和NP-Hard有什么区别？ 1064