stemCompletion无效

Sun*_*nil 8 tm

我使用tm包进行修复数据的文本分析,将数据读入数据框,转换为Corpus对象,应用各种方法使用lower,stipWhitespace,removestopwords等清理数据.

取回Corpus对象为stemCompletion.

使用tm_map函数执行了stemDocument,我的对象词被阻止了

得到了预期的结果.

当我使用tm_map函数运行stemCompletion操作时,它不起作用并得到以下错误

UseMethod("words")中的错误:没有适用于"字"的方法应用于"字符"类的对象

执行trackback()以显示并获得如下步骤

> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)
Run Code Online (Sandbox Code Playgroud)

我该如何解决这个错误?

cdx*_*sza 6

使用tm v0.6时收到了同样的错误.我怀疑这是因为stemCompletion不是这个版本的tm包的默认转换:

>  getTransformations
function () 
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument", 
    "stripWhitespace")
<environment: namespace:tm>
Run Code Online (Sandbox Code Playgroud)

现在,该tolower功能具有相同的问题,但可以通过使用该content_transformer功能使其可操作.我试过类似的方法,stemCompletion但没有成功.

请注意,即使stemCompletion不是默认转换,当手动输入词干时,它仍然有效:

> stemCompletion("compani",dictCorpus)
    compani 
"companies" 
Run Code Online (Sandbox Code Playgroud)

因此,我可以继续我的工作,我通过单个空格手动分隔语料库中的每个文档,通过它们将它们stemCompletion连接起来,然后将它们与以下(笨重且不优雅!)函数连接在一起:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
Run Code Online (Sandbox Code Playgroud)

哪里dictCorpus只是清理过的语料库的副本,但在它被阻止之前.额外stripWhitespace是特定于我的语料库,但对于一般语料库可能是良性的.您可能希望type根据需要从"最短" 更改选项.


举一个完整的例子,让我们使用crudetm包中的数据设置一个虚拟语料库:

> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)

> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter

> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today 
made light fall oil product price weak crude oil market compani spokeswoman said diamond 
latest line us oil compani cut contract post price last two day cite weak oil market reuter

> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel 
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today 
made light fall oil product price weak crude oil market companies spokeswoman said diamond 
latest line us oil companies cut contract posted price last two day cited weak oil market reuter
Run Code Online (Sandbox Code Playgroud)

注意:这个例子很奇怪,因为拼写错误的单词"copany"被映射: - >"copani" - >"NA",在这个过程中.不知道如何纠正这个......

为了stemCompletion_mod贯穿整个语料库,我只使用sapply(或parSapply使用雪包).

也许比我更有经验的人可以建议更简单的修改,stemCompletion以便在tm包的v0.6中工作.


dar*_*zig 5

我在以下工作流程中取得了成功:

  1. 用于content_transformer在语料库的每个文档上应用匿名函数,
  2. 将文档按空格分割为单词,
  3. stemCompletion在字典的帮助下打电话给单词,
  4. 并再次将单独的单词连接成文档paste.

POC演示代码:

tm_map(c, content_transformer(function(x, d)
  paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d)
Run Code Online (Sandbox Code Playgroud)

PS:使用c作为变量名来存储语料库不是一个好主意base::c


小智 5

谢谢,cdxsza.你的方法适合我.

给所有将要使用的人的说明stemCompletion:

该函数使用字典中的单词完成一个空字符串,这是意外的.请参阅下面的示例,其中为字符串开头的空白生成第一个"星期一".

stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday"))


[1]   "monday"  "monday" "tuesday" 
Run Code Online (Sandbox Code Playgroud)

它可以通过除去空字符串容易地固定""之前stemCompletion如下.

stemCompletion2 <- function(x, dictionary) {

   x <- unlist(strsplit(as.character(x), " "))

   x <- x[x != ""]

   x <- stemCompletion(x, dictionary=dictionary)

   x <- paste(x, sep="", collapse=" ")

   PlainTextDocument(stripWhitespace(x))

 }

 myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)

 myCorpus <- Corpus(VectorSource(myCorpus))
Run Code Online (Sandbox Code Playgroud)

请参阅http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf中幻灯片第12页的详细示例 .

问候

赵延昌

RdataMining.com