Cro*_*ops 0 string r strsplit dataframe data.table
我有一个数据帧data的字符向量如下.
x <- c("kal, Kon, Jor, Kara", "Bruce, Helena, Martha, Terry", "connor, oliver, Roy",
"Alan, Guy, Simon, Kyle")
y <- c("Mon, Cir, John, Jor", "Damian, Terry, Jason", "Mia, Roy", "John, Cary")
data <- data.frame(x,y, stringsAsFactors=FALSE)
Run Code Online (Sandbox Code Playgroud)
我试图连接两列中的字符串x和y新列z.我想删除重复项并,在连续连接字符串之前对分隔的单词进行排序.我能够达到以下目的.
x <- strsplit(data$x, split=", ")
y <- strsplit(data$y, split=", ")
data$z <- sapply(1:length(x), function(i) paste(sort(union(x[[i]], y[[i]])),
collapse=", "))
Run Code Online (Sandbox Code Playgroud)
有没有更快的方法来创建中间列表,可能使用data.table?
你可以试试一个regex解决方案.但是,这不会像你想要的那样排序.
v1 <- paste(data[,1], data[,2], sep=", ")
data$z <- sub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*),', "", v1, perl=TRUE)
Run Code Online (Sandbox Code Playgroud)
可以在regex101查看正则表达式
其他选择包括
library(splitstackshape)
library(data.table)
cbind(data[,1:2, with=FALSE],cSplit(setDT(data)[, indx:=1:.N],
c('x', 'y'), sep=",", 'long')[ ,
list(z=toString(unique(na.omit(unlist(.SD))))),
by=indx][,indx:=NULL])
x y
#1: kal, Kon, Jor, Kara Mon, Cir, John, Jor
#2: Bruce, Helena, Martha, Terry Damian, Terry, Jason
#3: connor, oliver, Roy Mia, Roy
#4: Alan, Guy, Simon, Kyle John, Cary
# z
#1: kal, Kon, Jor, Kara, Mon, Cir, John
#2: Bruce, Helena, Martha, Terry, Damian, Jason
#3: connor, oliver, Roy, Mia
#4: Alan, Guy, Simon, Kyle, John, Cary
Run Code Online (Sandbox Code Playgroud)
或使用stringi包
library(stringi)
data$z <- vapply(stri_extract_all_regex(paste(data$x, data$y), '\\w+'),
function(x) toString(sort(unique(x))), character(1))
Run Code Online (Sandbox Code Playgroud)
基于一个不那么大的数据集,
data <- data[rep(1:nrow(data), 3e4),]
row.names(data) <- NULL
cath <- function(){
apply(data,1,function(vec){
paste(sort(unique(strsplit(paste(vec[1],
vec[2],sep=", "),", ")[[1]])),collapse=", ")
})
}
akrun2 <- function(){
vapply(stri_extract_all_regex(paste(data$x, data$y), '\\w+'),
function(x) toString(sort(unique(x))), character(1))
}
akrun3 <- function(){
v1 <- paste(data[,1], data[,2], sep=", ")
sub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*),', "", v1, perl=TRUE)
}
microbenchmark(cath(), akrun2(), akrun3(),unit='relative', times=10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# cath() 11.700071 11.979908 11.700118 11.76762 11.57583 11.40806 10 c
#akrun2() 7.175622 7.225212 7.217322 7.19431 7.09539 7.31929 10 b
#akrun3() 1.000000 1.000000 1.000000 1.00000 1.00000 1.00000 10 a
Run Code Online (Sandbox Code Playgroud)