Sparklyr 分割字符串(到字符串)

hjk*_*kop 3 r apache-spark apache-spark-ml sparklyr

尝试在sparklyr中分割字符串,然后将其用于连接/过滤

我尝试了建议的方法,将字符串标记化,然后将其分隔为新列。这是一个可重现的示例(请注意,我必须将在 copy_to 之后变成字符串“NA”的 NA 转换为实际的 NA,有没有办法不必这样做)

x <- data.frame(Id=c(1,2,3,4),A=c('A-B','A-C','A-D',NA))
df <- copy_to(sc,x,'df')

df %>%  mutate(A = ifelse(A=='NA',NA,A)) %>% ft_regex_tokenizer(input.col="A", output.col="B", pattern="-",to_lower_case=F) %>% 
    sdf_separate_column("B", into=c("C", "D")) %>% filter(C=='A') 
Run Code Online (Sandbox Code Playgroud)

问题是,如果我尝试过滤新创建的列(例如%>% filter(C=='A')或加入它们,我会收到错误,请参见下文

x <- data.frame(Id=c(1,2,3,4),A=c('A-B','A-C','A-D',NA))
df <- copy_to(sc,x,'df')

df %>%  mutate(A = ifelse(A=='NA',NA,A)) %>% ft_regex_tokenizer(input.col="A", output.col="B", pattern="-",to_lower_case=F) %>% 
    sdf_separate_column("B", into=c("C", "D")) %>% filter(C=='A') 
Run Code Online (Sandbox Code Playgroud)

不知道为什么,因为根据 sdf_schema 创建的列的类型是“StringType”。

有没有一种解决方案使用sparklyr来实际分离列,我稍后可以将其用作字符串,而不必将帧写出到文件,或者必须收集到驱动程序节点?

zer*_*323 5

在这里使用 Spark ML 转换器并不是一个好的选择。相反,你应该split执行以下操作:

df %>% 
  mutate(B = split(A, "-")) %>% 
  sdf_separate_column("B", into = c("C", "D")) %>%
  filter(C %IS NOT DISTINCT FROM% "A") 
Run Code Online (Sandbox Code Playgroud)
df %>% 
  mutate(B = split(A, "-")) %>% 
  sdf_separate_column("B", into = c("C", "D")) %>%
  filter(C %IS NOT DISTINCT FROM% "A") 
Run Code Online (Sandbox Code Playgroud)

或者regexp_extract

pattern <- "^(.*)-(.*)$"

df %>% 
   mutate(
     C = regexp_extract(A, pattern, 1),
     D = regexp_extract(A, pattern, 2)
   ) %>%
   filter(C %IS NOT DISTINCT FROM% "A") 
Run Code Online (Sandbox Code Playgroud)
# Source: spark<?> [?? x 5]
     Id A     B          C     D    
  <dbl> <chr> <list>     <chr> <chr>
1     1 A-B   <list [2]> A     B    
2     2 A-C   <list [2]> A     C    
3     3 A-D   <list [2]> A     D  
Run Code Online (Sandbox Code Playgroud)

尽管如此,如果你想RegexpTokenzier工作,你首先要有句柄NULLNA在外部 R 类型中)。例如可以通过以下方式完成coalesce

tokenizer <- ft_regex_tokenizer(
  sc, input_col = "A", output_col = "B",
  pattern = "-", to_lower_case = F
)

df %>%  
  mutate(A = coalesce(A, "")) %>% 
  ml_transform(tokenizer, .) %>%
  sdf_separate_column("B", into=c("C", "D")) %>%
  filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
pattern <- "^(.*)-(.*)$"

df %>% 
   mutate(
     C = regexp_extract(A, pattern, 1),
     D = regexp_extract(A, pattern, 2)
   ) %>%
   filter(C %IS NOT DISTINCT FROM% "A") 
Run Code Online (Sandbox Code Playgroud)

或者首先删除丢失的数据:

df %>%  
  # or filter(!is.na(A))
  na.omit(columns=c("A")) %>%                      
  ml_transform(tokenizer, .) %>%
  sdf_separate_column("B", into=c("C", "D")) %>%
  filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
# Source: spark<?> [?? x 4]
     Id A     C     D    
  <dbl> <chr> <chr> <chr>
1     1 A-B   A     B    
2     2 A-C   A     C    
3     3 A-D   A     D    
Run Code Online (Sandbox Code Playgroud)