使用tm和RWeka创建N-Grams - 与VCorpus合作但不与Corpus合作

Pau*_*l_J 6 r n-gram tm term-document-matrix rweka

在使用'tm'和'RWeka'软件包创建biGrams的许多指南之后,我感到很沮丧的是,在tdm中只返回了1克.通过大量的反复试验,我发现使用' VCorpus '但不使用' Corpus ' 可以实现正常的功能.顺便说一句,我很确定这是在1个月前与'Corpus'合作但现在不是.

R(3.3.3),RTools(3.4),RStudio(1.0.136)和所有软件包(tm 0.7-1,RWeka 0.4-31)已更新至最新版本.

如果对于语料库不起作用以及其他人是否有同样的问题,我将不胜感激.

#A Reproducible example
#
#Weka bi-gram test
#

library(tm)
library(RWeka)

someCleanText <- c("Congress shall make no law respecting an establishment of",
                    "religion, or prohibiting the free exercise thereof or",
                    "abridging the freedom of speech or of the press or the",
                    "right of the people peaceably to assemble and to petition",
                    "the Government for a redress of grievances")

aCorpus <- Corpus(VectorSource(someCleanText))   #With this, only 1-Grams are created
#aCorpus <- VCorpus(VectorSource(someCleanText)) #With this, biGrams are created as desired

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

aTDM <- TermDocumentMatrix(aCorpus, control=list(tokenize=BigramTokenizer))

print(aTDM$dimnames$Terms)
Run Code Online (Sandbox Code Playgroud)

'Corpus'的结果

 [1] "congress"      "establishment" "law"           "make"         
 [5] "respecting"    "shall"         "exercise"      "free"         
 [9] "prohibiting"   "religion"      "the"           "thereof"      
[13] "abridging"     "freedom"       "press"         "speech"       
[17] "and"           "assemble"      "peaceably"     "people"       
[21] "petition"      "right"         "for"           "government"   
[25] "grievances"    "redress"
Run Code Online (Sandbox Code Playgroud)

"VCorpus"的结果

 [1] "a redress"        "abridging the"    "an establishment" "and to"          
 [5] "assemble and"     "congress shall"   "establishment of" "exercise thereof"
 [9] "for a"            "free exercise"    "freedom of"       "government for"  
[13] "law respecting"   "make no"          "no law"           "of grievances"   
[17] "of speech"        "of the"           "or of"            "or prohibiting"  
[21] "or the"           "peaceably to"     "people peaceably" "press or"        
[25] "prohibiting the"  "redress of"       "religion or"      "respecting an"   
[29] "right of"         "shall make"       "speech or"        "the free"        
[33] "the freedom"      "the government"   "the people"       "the press"       
[37] "thereof or"       "to assemble"      "to petition"
Run Code Online (Sandbox Code Playgroud)

小智 0

我之前使用的是 R.3.4.1,后来更改为 R3.3.3,现在 VCorpus 解决方案对我有用。TM 和 RWeka 都正确创建了二元组。

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Run Code Online (Sandbox Code Playgroud)