Snowball Stemmer只是最后一个词

Chr*_*ian 7 r stemming tm

我想使用R中的tm包来阻止纯文本文档语料库中的文档.当我将SnowballStemmer函数应用于语料库的所有文档时,只会阻止每个文档的最后一个单词.

library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
               readerControl = list(reader = readPlain, language = "en_US",
                                    load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem
Run Code Online (Sandbox Code Playgroud)

我认为这与文档被读入语料库的方式有关.用一些简单的例子说明这一点:

> vec<-c("running runner runs","happyness happies")
> stemDocument(vec) 
   [1] "running runner run" "happyness happi" 

> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
   [1] "run"    "runner" "run"    "happy"  "happi" <- 

> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
   Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   [[1]]
   run runner run

   [[2]]
   happy happi

> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" ,  load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   $`1.txt`
   running runner runs

   $`2.txt`
   happyness happies
Run Code Online (Sandbox Code Playgroud)

App*_*eue 3

我看到的问题是wordStem接受单词向量,但 Corpus plainTextReader 假设在它读取的文档中,每个单词都在自己的行上。换句话说,这会让 plainTextReader 感到困惑,因为你的文档中最终会出现 3 个“单词”

From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes
Run Code Online (Sandbox Code Playgroud)

相反,该文件应该是

From
ancient
grudge
break
to
new
mutiny
where 
civil
...etc...
Run Code Online (Sandbox Code Playgroud)

另请注意,标点符号也会混淆wordStem,因此您也必须将它们删除。

在不修改实际文档的情况下执行此操作的另一种方法是定义一个函数,该函数将进行分离并删除出现在单词之前或之后的非字母数字。这是一个简单的:

wordStem2 <- function(x) {
    mywords <- unlist(strsplit(x, " "))
    mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
    mycleanwords <- mycleanwords[mycleanwords != ""]
    wordStem(mycleanwords)
}

corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));
Run Code Online (Sandbox Code Playgroud)

现在只需使用 corpB 作为您常用的语料库即可。