从R中的用户定义语料库中删除停用词

Question

从R中的用户定义语料库中删除停用词

我有一套文件:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

Run Code Online (Sandbox Code Playgroud)

在这组文件中,我想删除停用词.我已经删除了标点并转换为小写,使用:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

Run Code Online (Sandbox Code Playgroud)

首先我转换为Corpus对象:

documents <- Corpus(VectorSource(documents))

Run Code Online (Sandbox Code Playgroud)

然后我尝试删除停用词:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

Run Code Online (Sandbox Code Playgroud)

但是最后一行会导致以下错误:

调试的THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC().

这已经在这里被问到,但没有给出答案.这个错误是什么意思？

编辑

是的,我正在使用tm包.

这是sessionInfo()的输出:

R版本3.0.2(2013-09-25)平台:x86_64-apple-darwin10.8.0(64位)

Answer 1

Mha*_*ill 10

当我遇到tm问题时,我经常最终只是编辑原始文本.

为了删除单词,它有点尴尬,但你可以从tm一个停用词列表中粘贴一个正则表达式.

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，5 月前
查看次数：	29538 次
最近记录：	7 年，9 月前