闭包作为数据合并成语的解决方案

Ari*_*man 6 closures functional-programming r

我试图把我的脑袋缠在一起,我我已经找到了一个他们可能会有所帮助的案例.

我有以下几件可供使用:

  • 一组用于清理状态名称的正则表达式,位于函数中
  • 带有状态名称(上面创建的函数的标准化形式)和状态ID代码的data.frame,用于链接两者("合并映射")

这个想法是,给定一些带有草率国家名称的数据框架(资本列为"华盛顿特区","华盛顿特区","哥伦比亚特区"等)?,让一个函数返回相同的数据删除了状态名称列的.frame,仅剩下状态ID代码.然后,后续合并可以一致地发生.

我可以通过多种方式实现这一点,但是一种似乎特别优雅的方法是将合并映射和正则表达式以及代码处理闭包内的所有内容(遵循闭包是一个带数据的函数的想法) ).

问题1:这是一个合理的想法吗?

问题2:如果是这样,我该怎么做R?

这是一个愚蠢的简单干净状态名称函数,适用于示例数据:

cleanStateNames <- function(x) {
  x <- tolower(x)
  x[grepl("columbia",x)] <- "DC"
  x
}
Run Code Online (Sandbox Code Playgroud)

以下是将运行最终函数的一些示例数据:

dat <- structure(list(state = c("Alabama", "Alaska", "Arizona", "Arkansas", 
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia", 
"Florida"), pop08 = structure(c(29L, 44L, 40L, 18L, 25L, 30L, 
22L, 48L, 36L, 13L), .Label = c("1,050,788", "1,288,198", "1,315,809", 
"1,316,456", "1,523,816", "1,783,432", "1,814,468", "1,984,356", 
"10,003,422", "11,485,910", "12,448,279", "12,901,563", "18,328,340", 
"19,490,297", "2,600,167", "2,736,424", "2,802,134", "2,855,390", 
"2,938,618", "24,326,974", "3,002,555", "3,501,252", "3,642,361", 
"3,790,060", "36,756,666", "4,269,245", "4,410,796", "4,479,800", 
"4,661,900", "4,939,456", "5,220,393", "5,627,967", "5,633,597", 
"5,911,605", "532,668", "591,833", "6,214,888", "6,376,792", 
"6,497,967", "6,500,180", "6,549,224", "621,270", "641,481", 
"686,293", "7,769,089", "8,682,661", "804,194", "873,092", "9,222,414", 
"9,685,744", "967,440"), class = "factor")), .Names = c("state", 
"pop08"), row.names = c(NA, 10L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)

一个示例合并映射(实际的映射将FIPS代码链接到状态,因此不能轻易生成):

merge_map <- data.frame(state=dat$state, id=seq(10) )
Run Code Online (Sandbox Code Playgroud)

编辑建立在下面的crippledlambda的答案,这是一个尝试该功能:

prepForMerge <- local({
  merge_map <- structure(list(state = c("alabama", "alaska", "arizona", "arkansas",  "california", "colorado", "connecticut", "delaware", "DC", "florida" ), id = 1:10), .Names = c("state", "id"), row.names = c(NA, -10L ), class = "data.frame")
  list(
    replace_merge_map=function(new_merge_map) {
      merge_map <<- new_merge_map
    },
    show_merge_map=function() {
      merge_map
    },
    return_prepped_data.frame=function(dat) {
      dat$state <- cleanStateNames(dat$state)
      dat <- merge(dat,merge_map)
      dat <- subset(dat,select=c(-state))
      dat
    }
  )
})

> prepForMerge$return_prepped_data.frame(dat)
        pop08 id
1   4,661,900  1
2     686,293  2
3   6,500,180  3
4   2,855,390  4
5  36,756,666  5
6   4,939,456  6
7   3,501,252  7
8     591,833  9
9     873,092  8
10 18,328,340 10
Run Code Online (Sandbox Code Playgroud)

在我考虑这个问题解决之前,还有两个问题:

  1. prepForMerge$return_prepped_data.frame(dat)每次打电话都很痛苦.有任何方法可以使用默认函数,以便我可以调用prepForMerge(dat)吗?我猜不知道它是如何实现的,但也许至少有一个默认fxn的约定....

  2. 如何避免在merge_map定义中混合数据和代码?理想情况下,我会在其他地方清理merge_map,然后在封闭内部抓住它并存储它.

hat*_*rix 4

我可能错过了你的问题的要点,但这是你可以使用闭包的一种方式:

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   function(patt,newtext) {
+     statenames <- tolower(statenames)
+     statenames[grepl(patt,statenames)] <- newtext
+     statenames
+   }
+ })
> 
> replaceStateNames("columbia","DC")
 [1] "alabama"     "alaska"      "arizona"     "arkansas"    "california" 
 [6] "colorado"    "connecticut" "delaware"    "DC"          "florida"    
> replaceStateNames("alaska","palincountry")
 [1] "alabama"              "palincountry"         "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "florida"             
> replaceStateNames("florida","jebbushland")
 [1] "alabama"              "alaska"               "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "jebbushland"    
> 
Run Code Online (Sandbox Code Playgroud)

但概括而言,您可以替换statenames为数据框定义,并返回使用此数据框的函数(或函数列表),而无需将其作为参数传递给函数调用。示例(但请注意我ignore.case=TRUE在 中使用了参数grepl):

> replaceStateNames <- local({
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   list(justreturn=function(patt,newtext) {
+     statenames[grepl(patt,statenames,ignore.case=TRUE)] <- newtext
+     statenames
+   },reassign=function(patt,newtext) {
+     statenames <<- replace(statenames,grepl(patt,statenames,ignore.case=TRUE),newtext)
+     statenames
+   })
+ })
Run Code Online (Sandbox Code Playgroud)

就像第一个例子一样:

> replaceStateNames$justreturn("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    
Run Code Online (Sandbox Code Playgroud)

仅返回词法范围的值statenames来检查原始值是否未更改:

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"              "Alaska"               "Arizona"             
 [4] "Arkansas"             "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"             "District of Columbia"
[10] "Florida"             
Run Code Online (Sandbox Code Playgroud)

做同样的事情,但使更改“永久”:

> replaceStateNames$reassign("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    
Run Code Online (Sandbox Code Playgroud)

请注意,附加到这些函数的值statenames已更改。

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    
Run Code Online (Sandbox Code Playgroud)

在任何情况下,您都可以替换statenames为数据框,并将这些简单的函数替换为“合并映射”或您想要的任何其他映射。

编辑

说到“合并”,这是您正在寻找的吗??merge使用闭包实现第一个示例:

> authors <- data.frame(surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+                       nationality = c("US", "Australia", "US", "UK", "Australia"),
+                       deceased = c("yes", rep("no", 4)))
> books <- data.frame(name = I(c("Tukey", "Venables", "Tierney",
+                       "Ripley", "Ripley", "McNeil", "R Core")),
+                     title = c("Exploratory Data Analysis",
+                       "Modern Applied Statistics ...",
+                       "LISP-STAT",
+                       "Spatial Statistics", "Stochastic Simulation",
+                       "Interactive Data Analysis",
+                       "An Introduction to R"),
+                     other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                       "Venables & Smith"))
> 
> mergewithauthors <- with(list(authors=authors),function(books) 
+   merge(authors, books, by.x = "surname", by.y = "name"))
> 
> mergewithauthors(books)
   surname nationality deceased                         title other.author
1   McNeil   Australia       no     Interactive Data Analysis         <NA>
2   Ripley          UK       no            Spatial Statistics         <NA>
3   Ripley          UK       no         Stochastic Simulation         <NA>
4  Tierney          US       no                     LISP-STAT         <NA>
5    Tukey          US      yes     Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics ...       Ripley
Run Code Online (Sandbox Code Playgroud)

编辑2

要将文件读入将按词法绑定的对象,您可以执行以下操作

fn <- local({
  data <- read.csv("filename.csv")
  function(...) {
    ...
  }
})
Run Code Online (Sandbox Code Playgroud)

或者

fn <- with(list(data=read.csv("filename.csv")),
     function(...) {
       ...
     }
   })
Run Code Online (Sandbox Code Playgroud)

或者

fn <- with(local(data <- read.csv("filename.csv")),
     function(...) {
       ...
     }
   })
Run Code Online (Sandbox Code Playgroud)

等等。(我假设函数(...)将与你的“merge_map”有关)。您也可以使用evalq来代替local. 要“引入”驻留在全局空间(或封闭环境)中的对象,您可以执行以下操作

globalobj <- value      ## could be from read.csv()
fn <- local({
  localobj <- globalobj ## if globalobj is not locally defined, 
                        ## R will look in enclosing environment
                        ## in this case, the globalenv()
  function(...) {
    ...
  }
})
Run Code Online (Sandbox Code Playgroud)

那么稍后的修改globalobj不会改变localobj附加到函数上的内容(因为几乎(?)R 中的所有内容都遵循按值传递语义)。您也可以使用上面示例中所示with的代替local