Sta*_*ess 4 r topic-modeling tm
我有一套文件:
documents = c("She had toast for breakfast",
"The coffee this morning was excellent",
"For lunch let's all have pancakes",
"Later in the day, there will be more talks",
"The talks on the first day were great",
"The second day should have good presentations too")
Run Code Online (Sandbox Code Playgroud)
在这组文件中,我想删除停用词.我已经删除了标点并转换为小写,使用:
documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation
Run Code Online (Sandbox Code Playgroud)
首先我转换为Corpus对象:
documents <- Corpus(VectorSource(documents))
Run Code Online (Sandbox Code Playgroud)
然后我尝试删除停用词:
documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords
Run Code Online (Sandbox Code Playgroud)
但是最后一行会导致以下错误:
调试的THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC().
编辑
是的,我正在使用tm包.
这是sessionInfo()的输出:
R版本3.0.2(2013-09-25)平台:x86_64-apple-darwin10.8.0(64位)
Mha*_*ill 10
当我遇到tm问题时,我经常最终只是编辑原始文本.
为了删除单词,它有点尴尬,但你可以从tm一个停用词列表中粘贴一个正则表达式.
stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')
> documents
[1] " toast breakfast" " coffee morning excellent"
[3] " lunch lets pancakes" "later day will talks"
[5] " talks first day great" " second day good presentations "
Run Code Online (Sandbox Code Playgroud)