矢量化循环以加速R中的程序

mar*_*abe 4 r

我正在为R中的for循环寻找一些简单的矢量化方法.我有以下数据框,包含句子和两个正面和负面词典:

# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
                         "wouldnt bad notebook", "very good quality", "orgtop",
                         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
               stringsAsFactors=F)

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
Run Code Online (Sandbox Code Playgroud)

现在我创建原始数据框的副本来模拟大数据集:

# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL
Run Code Online (Sandbox Code Playgroud)

对于我的下一步,我将不得不用字词分数(pos word = 1和neg word = -1)对字典中的单词进行降序排序.

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
Run Code Online (Sandbox Code Playgroud)

然后我用for循环定义以下函数:

# Sentiment score function
scoreSentence2 <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
    if(count){
      score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
      sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
      # library(qdapRegex)
      sentence <- rm_white(sentence)
    }
  }
  score
}
Run Code Online (Sandbox Code Playgroud)

我在数据框中的句子上调用了上一个函数:

# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user       system    elapsed
# 1054.19    0.09      1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)
Run Code Online (Sandbox Code Playgroud)

期望的输出是:

Words                                             user      SentimentScore2
just right size and i love this notebook          1         2
benefits great laptop                             2         1
wouldnt bad notebook                              3         1
very good quality                                 4         1
orgtop                                            5         0
  .
  .
  .
Run Code Online (Sandbox Code Playgroud)

等等...

请,任何人都可以帮我减少原始方法的计算时间.由于我的初学者编程技巧在最后我是:-)任何你的帮助或建议将非常感激.非常感谢你提前.

cmb*_*rbu 5

本着"教人钓鱼比钓鱼更好"的精神,我将引导您完成:

  1. 复制你的代码:你会搞砸它!

  2. 找到瓶颈:

    1a:使问题变小:

    Rep  <- 100
    df.expanded <- as.data.frame(replicate(nRep,sent$words))
    library(zoo)
    sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]
    
    Run Code Online (Sandbox Code Playgroud)

    1b:保留一个参考解决方案:你将改变你的代码,并且几乎没有什么活动可以引入错误而不是优化代码!

    sentRef <- sent
    
    Run Code Online (Sandbox Code Playgroud)

    并添加相同但在代码末尾注释掉以记住您的引用位置.为了更容易检查您是不是搞乱了您的代码,您可以在代码的末尾自动测试它:

    library("testthat")
    expect_equal(sent,sentRef)
    
    Run Code Online (Sandbox Code Playgroud)

    1c:在代码周围触发探查器以查看:

    Rprof()
    SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
    Rprof(NULL)
    
    Run Code Online (Sandbox Code Playgroud)

    1d:查看结果,基数为R:

    summaryRprof()
    
    Run Code Online (Sandbox Code Playgroud)

    还有更好的工具,你可以检查包profileR或lineprof

    lineprof是我选择的工具,这里有一个真正的附加值,允许将问题缩小到这两行:

    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
    
    Run Code Online (Sandbox Code Playgroud)
  3. 修理它.

    3.1幸运的是,主要问题相当简单:你不需要第一行在函数中,之前移动它.顺便说一句,这同样适用于你的paste0().您的代码变为:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- length(grep(matchWords[x],sentence)) # count them
            if(count){
                score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
                sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
                require(qdapRegex)
                # sentence <- rm_white(sentence)
            }
        }
        score
    }
    
    Run Code Online (Sandbox Code Playgroud)

    这将1000个代表的执行时间从
    5.64s更改为2.32s.投资不错!

    3.2下一个颈部是"计数< - "线,但我认为影子的答案恰到好处:-)我们得到的结合:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- grepl(matchWords[x],sentence) # count them
            score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
            sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
            require(qdapRegex)
            # sentence <- rm_white(sentence)
        }
        score
    }
    
    Run Code Online (Sandbox Code Playgroud)

这样可以快0.18s或31倍......