从字符向量中提取和计算常见的单词对

law*_*yeR 5 r tm regex-lookarounds qdap

如何在角色向量中找到频繁的相邻单词对?例如,使用原油数据集,一些常见的货币对是"原油","石油市场"和"百万桶".

下面的小例子的代码试图识别频繁的术语,然后使用正向前瞻断言,计算频繁术语立即跟随这些频繁术语的次数.但是这次尝试坠毁并烧毁了.

任何指导都将被理解为如何创建在第一列("对")中显示公共对的数据帧以及在第二列("计数")中显示它们在文本中出现的次数.

   library(qdap)
   library(tm)

# from the crude data set, create a text file from the first three documents, then clean it

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10) 

# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")

# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
Run Code Online (Sandbox Code Playgroud)

这是努力步履蹒跚的地方.

不知道Java或Python,这些没有帮助Java计算单词对 Python计数单词对,但它们可能是其他人的有用参考.

谢谢.

ags*_*udy 1

这里的一个想法是创建一个带有二元组的新语料库:

二元组或二元组是标记字符串中两个相邻元素的每个序列

提取二元组的递归函数:

bigram <- 
  function(xs){
    if (length(xs) >= 2) 
       c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))

  }
Run Code Online (Sandbox Code Playgroud)

然后将其应用于tm包中的原始数据。(我在这里做了一些文本清理,但是这个步骤取决于文本)。

res <- unlist(lapply(crude,function(x){

  x <- tm::removeNumbers(tolower(x))
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  freqs <- table(bigram(strsplit(x," ")[[1]]))
  freqs[freqs>1]
}))


 as.data.frame(tail(sort(res),5))
                          tail(sort(res), 5)
reut-00022.xml.hold_a                      3
reut-00022.xml.in_the                      3
reut-00011.xml.of_the                      4
reut-00022.xml.a_futures                   4
reut-00010.xml.abdul_aziz                  5
Run Code Online (Sandbox Code Playgroud)

二元词“abdul aziz”和“a futures”是最常见的。您应该重新清理数据以删除 (of, the,..)。但这应该是一个好的开始。

OP评论后编辑:

如果您想获得所有语料库的二元组频率,一个想法是计算循环中的二元组,然后计算循环结果的频率。我受益于添加更好的文本处理清理。

res <- unlist(lapply(crude,function(x){
  x <- removeNumbers(tolower(x))
  x <- removeWords(x, words=c("the","of"))
  x <- removePunctuation(x)
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  words <- strsplit(x," ")[[1]]
  bigrams <- bigram(words[nchar(words)>2])
}))

xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]


#                 res Freq
#    1: abdulaziz_bin    1
#    2:  ability_hold    1
#    3:  ability_keep    1
#    4:  ability_sell    1
#    5:    able_hedge    1
# ---                   
# 2177:    last_month    6
# 2178:     crude_oil    7
# 2179:  oil_minister    7
# 2180:     world_oil    7
# 2181:    oil_prices   14
Run Code Online (Sandbox Code Playgroud)