如何在 R 中“拆分”文本文档或文本字符串，以便每个单词在数据框中都有自己的行？

Question

如何在 R 中“拆分”文本文档或文本字符串，以便每个单词在数据框中都有自己的行？

documents <- c("This is document number one", "document two is the second element of the vector")

Run Code Online (Sandbox Code Playgroud)

我试图创建的数据框是：

idealdf <- c("this", "is", "document", "number", "one", "document", "two", "is", "the", "second", "element", "of", "the", "vector")

Run Code Online (Sandbox Code Playgroud)

我一直在使用 tm 包将我的文档转换为语料库，并通过以下功能去除标点符号、转换为小写字母等：

#create a corpus:
myCorpus <- Corpus(VectorSource(documents))

#convert to lowercase:
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

#remove punctuation:
myCorpus <- tm_map(myCorpus, removePunctuation)

Run Code Online (Sandbox Code Playgroud)

...但我在尝试将其放入 df 时遇到了麻烦，其中每个单词都有自己的行（我更喜欢每个单词都有自己的行 - 即使同一个单词显示为多行）。

谢谢。

Answer 1

Ric*_*ven 5

怎么样

library(stringi)
data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(documents))))
#       words
# 1      this
# 2        is
# 3  document
# 4    number
# 5       one
# 6  document
# 7       two
# 8        is
# 9       the
# 10   second
# 11  element
# 12       of
# 13      the
# 14   vector

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	2948 次
最近记录：	10 年，2 月前