聚集在闪闪发光的

RPi*_*sco 3 r dplyr apache-spark sparklyr

我正在使用 sparklyr 来操作一些数据。给定一个,

a<-tibble(id = rep(c(1,10), each = 10),
          attribute1 = rep(c("This", "That", 'These', 'Those', "The", "Other", "Test", "End", "Start", 'Beginning'), 2),
          value = rep(seq(10,100, by = 10),2),
          average = rep(c(50,100),each = 10),
          upper_bound = rep(c(80, 130), each =10),
          lower_bound = rep(c(20, 70), each =10))
Run Code Online (Sandbox Code Playgroud)

我想使用“收集”来操作数据,如下所示:

b<- a %>% 
     gather(key = type_data, value = value_data, -c(id:attribute1))
Run Code Online (Sandbox Code Playgroud)

但是,“收集”在 sparklyr 上不可用。我见过一些人使用 sdf_pivot 来模仿“收集”(例如,如何在 sparklyr 中使用 sdf_pivot() 并连接字符串?)但我看不出在这种情况下如何使用它。

有没有人有想法?

干杯!

小智 7

这是一个gather在 sparklyr 中模仿的函数。这将收集给定的列,同时保持其他所有内容完好无损,但如果需要,它可以轻松扩展。

# Function
sdf_gather <- function(tbl, gather_cols){

  other_cols <- colnames(tbl)[!colnames(tbl) %in% gather_cols]

  lapply(gather_cols, function(col_nm){
    tbl %>% 
      select(c(other_cols, col_nm)) %>% 
      mutate(key = col_nm) %>%
      rename(value = col_nm)  
  }) %>% 
    sdf_bind_rows() %>% 
    select(c(other_cols, 'key', 'value'))
}

# Example
spark_df %>% 
  select(col_1, col_2, col_3, col_4) %>% 
  sdf_gather(c('col_3', 'col_4'))
Run Code Online (Sandbox Code Playgroud)


zer*_*323 5

您可以使用map/设计等效项explode

sdf_gather <- function(data, key = "key", value = "value", ...) {
  cols <- list(...) %>% unlist()

  # Explode with map (same as stack) requires multiple aliases so
  # dplyr mutate won't work for us here.
  expr <- list(paste(
    "explode(map(",
    paste("'", cols, "',`",  cols, "`", sep = "", collapse = ","),
    ")) as (", key, ",", value, ")", sep = ""))

  keys <- data %>% colnames() %>% setdiff(cols) %>% as.list()

  data %>%
    spark_dataframe() %>% 
    sparklyr::invoke("selectExpr", c(keys, expr)) %>% 
    sdf_register()
}
Run Code Online (Sandbox Code Playgroud)

或蜂巢stack功能

sdf_gather <- function(data, key = "key", value = "value", ...) {
  cols <- list(...) %>% unlist()
  expr <- list(paste(
    "stack(", length(cols), ", ",
    paste("'", cols, "',`",  cols, "`", sep="", collapse=","),
    ") as (", key, ",", value, ")", sep=""))

  keys <- data %>% colnames() %>% setdiff(cols) %>% as.list()

  data %>%
    spark_dataframe() %>% 
    sparklyr::invoke("selectExpr", c(keys, expr)) %>% 
    sdf_register()
}
Run Code Online (Sandbox Code Playgroud)

两者都应该给出相同的结果:

long <- sdf_gather(
  df, "my_key", "my_value",
  "value", "average", "upper_bound", "lower_bound")
long
Run Code Online (Sandbox Code Playgroud)
sdf_gather <- function(data, key = "key", value = "value", ...) {
  cols <- list(...) %>% unlist()

  # Explode with map (same as stack) requires multiple aliases so
  # dplyr mutate won't work for us here.
  expr <- list(paste(
    "explode(map(",
    paste("'", cols, "',`",  cols, "`", sep = "", collapse = ","),
    ")) as (", key, ",", value, ")", sep = ""))

  keys <- data %>% colnames() %>% setdiff(cols) %>% as.list()

  data %>%
    spark_dataframe() %>% 
    sparklyr::invoke("selectExpr", c(keys, expr)) %>% 
    sdf_register()
}
Run Code Online (Sandbox Code Playgroud)

并且可以修改以支持非标准评估。

请注意,这两种方法都需要同种色谱柱类型。

笔记

explode 版本生成以下查询:

sdf_gather <- function(data, key = "key", value = "value", ...) {
  cols <- list(...) %>% unlist()
  expr <- list(paste(
    "stack(", length(cols), ", ",
    paste("'", cols, "',`",  cols, "`", sep="", collapse=","),
    ") as (", key, ",", value, ")", sep=""))

  keys <- data %>% colnames() %>% setdiff(cols) %>% as.list()

  data %>%
    spark_dataframe() %>% 
    sparklyr::invoke("selectExpr", c(keys, expr)) %>% 
    sdf_register()
}
Run Code Online (Sandbox Code Playgroud)

优化的逻辑执行计划

long <- sdf_gather(
  df, "my_key", "my_value",
  "value", "average", "upper_bound", "lower_bound")
long
Run Code Online (Sandbox Code Playgroud)

stack版本生成

# Source:   table<sparklyr_tmp_7b8f5989ba4d> [?? x 4]
# Database: spark_connection
      id attribute1 my_key      my_value
   <dbl> <chr>      <chr>          <dbl>
 1     1 This       value             10
 2     1 This       average           50
 3     1 This       upper_bound       80
 4     1 This       lower_bound       20
 5     1 That       value             20
 6     1 That       average           50
 7     1 That       upper_bound       80
 8     1 That       lower_bound       20
 9     1 These      value             30
10     1 These      average           50
# ... with more rows
Run Code Online (Sandbox Code Playgroud)

SELECT id, attribute1, 
       explode(map(
         'value', `value`,
         'average', `average`,
         'upper_bound', `upper_bound`,
         'lower_bound', `lower_bound`)) as (my_key,my_value)

FROM df
Run Code Online (Sandbox Code Playgroud)

单引号值(即'value'),在生成的 SQL 中是文字字符串,而反引号值表示列引用。