Rya*_*ase 0 r corpus text-mining tm
documents <- c("This is document number one", "document two is the second element of the vector")
Run Code Online (Sandbox Code Playgroud)
我试图创建的数据框是:
idealdf <- c("this", "is", "document", "number", "one", "document", "two", "is", "the", "second", "element", "of", "the", "vector")
Run Code Online (Sandbox Code Playgroud)
我一直在使用 tm 包将我的文档转换为语料库,并通过以下功能去除标点符号、转换为小写字母等:
#create a corpus:
myCorpus <- Corpus(VectorSource(documents))
#convert to lowercase:
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
#remove punctuation:
myCorpus <- tm_map(myCorpus, removePunctuation)
Run Code Online (Sandbox Code Playgroud)
...但我在尝试将其放入 df 时遇到了麻烦,其中每个单词都有自己的行(我更喜欢每个单词都有自己的行 - 即使同一个单词显示为多行)。
谢谢。
怎么样
library(stringi)
data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(documents))))
# words
# 1 this
# 2 is
# 3 document
# 4 number
# 5 one
# 6 document
# 7 two
# 8 is
# 9 the
# 10 second
# 11 element
# 12 of
# 13 the
# 14 vector
Run Code Online (Sandbox Code Playgroud)