hjk*_*kop 3 r apache-spark apache-spark-ml sparklyr
尝试在sparklyr中分割字符串,然后将其用于连接/过滤
我尝试了建议的方法,将字符串标记化,然后将其分隔为新列。这是一个可重现的示例(请注意,我必须将在 copy_to 之后变成字符串“NA”的 NA 转换为实际的 NA,有没有办法不必这样做)
x <- data.frame(Id=c(1,2,3,4),A=c('A-B','A-C','A-D',NA))
df <- copy_to(sc,x,'df')
df %>% mutate(A = ifelse(A=='NA',NA,A)) %>% ft_regex_tokenizer(input.col="A", output.col="B", pattern="-",to_lower_case=F) %>%
sdf_separate_column("B", into=c("C", "D")) %>% filter(C=='A')
Run Code Online (Sandbox Code Playgroud)
问题是,如果我尝试过滤新创建的列(例如%>% filter(C=='A')或加入它们,我会收到错误,请参见下文
x <- data.frame(Id=c(1,2,3,4),A=c('A-B','A-C','A-D',NA))
df <- copy_to(sc,x,'df')
df %>% mutate(A = ifelse(A=='NA',NA,A)) %>% ft_regex_tokenizer(input.col="A", output.col="B", pattern="-",to_lower_case=F) %>%
sdf_separate_column("B", into=c("C", "D")) %>% filter(C=='A')
Run Code Online (Sandbox Code Playgroud)
不知道为什么,因为根据 sdf_schema 创建的列的类型是“StringType”。
有没有一种解决方案使用sparklyr来实际分离列,我稍后可以将其用作字符串,而不必将帧写出到文件,或者必须收集到驱动程序节点?
在这里使用 Spark ML 转换器并不是一个好的选择。相反,你应该split执行以下操作:
df %>%
mutate(B = split(A, "-")) %>%
sdf_separate_column("B", into = c("C", "D")) %>%
filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
df %>%
mutate(B = split(A, "-")) %>%
sdf_separate_column("B", into = c("C", "D")) %>%
filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
或者regexp_extract
pattern <- "^(.*)-(.*)$"
df %>%
mutate(
C = regexp_extract(A, pattern, 1),
D = regexp_extract(A, pattern, 2)
) %>%
filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
# Source: spark<?> [?? x 5]
Id A B C D
<dbl> <chr> <list> <chr> <chr>
1 1 A-B <list [2]> A B
2 2 A-C <list [2]> A C
3 3 A-D <list [2]> A D
Run Code Online (Sandbox Code Playgroud)
尽管如此,如果你想RegexpTokenzier工作,你首先要有句柄NULL(NA在外部 R 类型中)。例如可以通过以下方式完成coalesce
tokenizer <- ft_regex_tokenizer(
sc, input_col = "A", output_col = "B",
pattern = "-", to_lower_case = F
)
df %>%
mutate(A = coalesce(A, "")) %>%
ml_transform(tokenizer, .) %>%
sdf_separate_column("B", into=c("C", "D")) %>%
filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
pattern <- "^(.*)-(.*)$"
df %>%
mutate(
C = regexp_extract(A, pattern, 1),
D = regexp_extract(A, pattern, 2)
) %>%
filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
或者首先删除丢失的数据:
df %>%
# or filter(!is.na(A))
na.omit(columns=c("A")) %>%
ml_transform(tokenizer, .) %>%
sdf_separate_column("B", into=c("C", "D")) %>%
filter(C %IS NOT DISTINCT FROM% "A")
Run Code Online (Sandbox Code Playgroud)
# Source: spark<?> [?? x 4]
Id A C D
<dbl> <chr> <chr> <chr>
1 1 A-B A B
2 2 A-C A C
3 3 A-D A D
Run Code Online (Sandbox Code Playgroud)