将"在字符串列表中设置"优化为"设置为矩阵"操作

Eri*_*ier 4 optimization r string-matching

我有一组包含空格分隔元素的字符串.我想建立一个矩阵,告诉我哪些元素是哪些字符串的一部分.例如:

""
"A B C"
"D"
"B D"
Run Code Online (Sandbox Code Playgroud)

应该给出类似的东西:

  A B C D
1
2 1 1 1
3       1
4   1   1
Run Code Online (Sandbox Code Playgroud)

现在我已经有了一个解决方案,但是它作为磨拉石运行缓慢,而且我已经没有关于如何加快速度的想法:

reverseIn <- function(vector, value) {
    return(value %in% vector)
}

buildCategoryMatrix <- function(valueVector) {
    allClasses <- c()
    for(classVec in unique(valueVector)) {
        allClasses <- unique(c(allClasses,
                               strsplit(classVec, " ", fixed=TRUE)[[1]]))
    }

    resMatrix <- matrix(ncol=0, nrow=length(valueVector))
    splitValues <- strsplit(valueVector, " ", fixed=TRUE)

    for(cat in allClasses) {
        if(cat=="") {
            catIsPart <- (valueVector == "")
        } else {
            catIsPart <- sapply(splitValues, reverseIn, cat)
        }
        resMatrix <- cbind(resMatrix, catIsPart)
    }
    colnames(resMatrix) <- allClasses

    return(resMatrix)
}
Run Code Online (Sandbox Code Playgroud)

分析函数给了我这个:

$by.self
                  self.time self.pct total.time total.pct
"match"               31.20    34.74      31.24     34.79
"FUN"                 30.26    33.70      74.30     82.74
"lapply"              13.56    15.10      87.86     97.84
"%in%"                12.92    14.39      44.10     49.11
Run Code Online (Sandbox Code Playgroud)

所以我的实际问题是: - "FUN"花费的33%来自哪里? - 有没有办法加快%in%的召唤?

我尝试在进入循环之前将字符串转换为因子,以便我匹配数字而不是字符串,但这实际上使R崩溃.我也尝试过部分矩阵赋值(IE,resMatrix [i,x] < - 1),其中i是字符串的编号,x是因子的矢量.也没有骰子,因为它似乎继续无限运行.

A5C*_*2T1 5

在我的"splitstackshape"包的开发版本中,有一个调用的辅助函数charBinaryMat可以用于这样的事情:

这是函数(因为CRAN上的软件包版本还没有):

charBinaryMat <- function(listOfValues, fill = NA) {
  lev <- sort(unique(unlist(listOfValues, use.names = FALSE)))
  m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev))
  colnames(m) <- lev
  for (i in 1:nrow(m)) {
    m[i, listOfValues[[i]]] <- 1
  }
  m
}
Run Code Online (Sandbox Code Playgroud)

输入预计是以下输出strsplit:

在这里它正在使用中:

str <- c("" , "A B C" , "D" , "B D" )

## Fill is `NA` by default
charBinaryMat(strsplit(str, " ", fixed=TRUE))
#       A  B  C  D
# [1,] NA NA NA NA
# [2,]  1  1  1 NA
# [3,] NA NA NA  1
# [4,] NA  1 NA  1

## Can easily be set to another value
charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
#      A B C D
# [1,] 0 0 0 0
# [2,] 1 1 1 0
# [3,] 0 0 0 1
# [4,] 0 1 0 1
Run Code Online (Sandbox Code Playgroud)

标杆

既然你的问题是关于更快的方法,那么让我们进行基准测试.

  1. 基准测试的功能:

    CBM <- function() {
      charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
    }
    BCM <- function() {
      buildCategoryMatrix(str)*1L
    }
    Sapply <- function() {
      y <- unique( unlist( strsplit( str , " " ) ) )
      out <- t(sapply(str, function(x) y %in% unlist(strsplit(x , " " )),
                      USE.NAMES = FALSE )) * 1L
      colnames(out) <- y
      out
    }
    
    Run Code Online (Sandbox Code Playgroud)
  2. 一些样本数据:

    set.seed(1)
    A = sample(10, 100000, replace = TRUE)
    str <- sapply(seq_along(A), function(x)
      paste(sample(LETTERS[1:10], A[x]), collapse = " "))
    head(str)
    # [1] "H G C"               "F H J G"             "H D J A I B"        
    # [4] "A C F H J B E G D I" "F C H"               "I C G B J D F A E" 
    
    Run Code Online (Sandbox Code Playgroud)
  3. 一些示例输出:

    ## Automatically sorted
    head(CBM())
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
    ## Sorting just for comparison
    head(BCM())[, LETTERS[1:10]]
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
    ## Sorting just for comparison
    head(Sapply())[, LETTERS[1:10]]
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
    Run Code Online (Sandbox Code Playgroud)
  4. 标杆:

    library(microbenchmark)
    microbenchmark(CBM(), BCM(), Sapply(), times=20)
    # Unit: milliseconds
    #      expr        min         lq     median         uq        max neval
    #     CBM()   675.0929   718.3454   777.2423   805.3872   858.6609    20
    #     BCM() 11059.6305 11267.9888 11367.3283 11595.1758 11792.5950    20
    #  Sapply()  3536.7755  3687.0308  3759.7388  3813.4233  3968.3192    20
    
    Run Code Online (Sandbox Code Playgroud)