优化R代码,根据自定义距离函数创建距离矩阵

Gau*_*hal 5 string performance r edit-distance levenshtein-distance

我正在尝试基于自定义距离函数为字​​符串创建距离矩阵(用于聚类).我在6000字的列表上运行代码,并且自上次90分钟后它仍在运行.我有8 GB RAM和Intel-i5,所以问题只在于代码.这是我的代码:

library(stringdist)
#Calculate distance between two monograms/bigrams
stringdist2 <- function(word1, word2)
{
    #for bigrams - phrases with two words
    if (grepl(" ",word1)==TRUE) {
        #"Hello World" and "World Hello" are not so different for me
        d=min(stringdist(word1, word2),
        stringdist(word1, gsub(word2, 
                          pattern = "(.*) (.*)", 
                          repl="\\2,\\1")))
    }
    #for monograms(words)
    else{
        #add penalty of 5 points if first character is not same
        #brave and crave are more different than brave and bravery
        d=ifelse(substr(word1,1,1)==substr(word2,1,1),
                            stringdist(word1,word2),
                            stringdist(word1,word2)+5)
    }   
    d
}
#create distance matrix
stringdistmat2 = function(arr)
{
    mat = matrix(nrow = length(arr), ncol= length(arr))
    for (k in 1:(length(arr)-1))
    {
        for (j in k:(length(arr)-1))
        {           
            mat[j+1,k]  = stringdist2(arr[k],arr[j+1])      
        }
    }
    as.dist(mat)    
}

test = c("Hello World","World Hello", "Hello Word", "Cello Word")
mydmat = stringdistmat2(test)
> mydmat
  1 2 3
2 1    
3 1 2  
4 2 3 1
Run Code Online (Sandbox Code Playgroud)

我认为问题可能是我使用循环而不是应用 - 但后来我发现循环不是那么低效.更重要的是,我不熟练使用申请我的循环是嵌套循环是k in 1:nj in k:n.我想知道是否还有其他可以优化的东西.

Col*_*vel 4

有趣的问题。所以一步一步来:

1 -stringdist函数已经向量化:

#> stringdist("byye", c('bzyte','byte'))
#[1] 2 1

#> stringdist(c('doggy','gadgy'), 'dodgy')
#[1] 1 2
Run Code Online (Sandbox Code Playgroud)

但是给出两个具有相同长度的向量,stringdist将导致在每个向量上并行循环(不会产生具有交叉结果的矩阵),如下Map所示:

#> stringdist(c("byye","alllla"), c('bzyte','byte'))
#[1] 2 6
Run Code Online (Sandbox Code Playgroud)

2 -重写您的函数,以便您的新函数保留此矢量化特征

stringdistFast <- function(word1, word2)
{
    d1 = stringdist(word1, word2)
    d2 = stringdist(word1, gsub("(.+) (.+)", "\\2 \\1", word2))

    ifelse(d1==d2,d1+5*(substr(d1,1,1)!=substr(d2,1,1)),pmin(d1,d2))
}
Run Code Online (Sandbox Code Playgroud)

它确实以同样的方式工作:

#> stringdistFast("byye", c('bzyte','byte'))
#[1] 2 1

#> stringdistFast("by ye", c('bzyte','byte','ye by'))
#[1] 3 2 0
Run Code Online (Sandbox Code Playgroud)

3 -重写 dismatrix 函数,仅使用一个循环并且仅在三角形部分上(不outer,它很慢!):

stringdistmatFast <- function(test)
{
    m = diag(0, length(test))
    sapply(1:(length(test)-1), function(i)
    {
        m[,i] <<- c(rep(0,i), stringdistFast(test[i],test[(i+1):length(test)]))
    }) 

    `dimnames<-`(m + t(m), list(test,test))
}
Run Code Online (Sandbox Code Playgroud)

4 -使用该功能:

#> stringdistmatFast(test)
#            Hello World World Hello Hello Word Cello Word
#Hello World           0           0          1          2
#World Hello           0           0          1          2
#Hello Word            1           1          0          1
#Cello Word            2           2          1          0
Run Code Online (Sandbox Code Playgroud)