将语料库转换为R中的data.frame

Cri*_*ira 6 r corpus dataframe tm

我正在使用tm包来应用词干,我需要将结果数据转换为数据帧.可以在这里找到解决方案R tm package vcorpus:将语料库转换为数据帧时出错,但在我的情况下,我将语料库的内容作为:

[[2195]]
i was very impress
Run Code Online (Sandbox Code Playgroud)

代替

[[2195]]
"i was very impress"
Run Code Online (Sandbox Code Playgroud)

因此,如果我申请

data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)
Run Code Online (Sandbox Code Playgroud)

结果将是

<NA>.
Run Code Online (Sandbox Code Playgroud)

任何帮助深表感谢!

以下代码为例:

sentence <- c("a small thread was loose on the sandals, otherwise it looked good")
mycorpus <- Corpus(VectorSource(sentence))
mycorpus <- tm_map(mycorpus, stemDocument, language = "english")

inspect(mycorpus)

[[1]]
a small thread was loo on the sandals, otherwi it look good

data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)

 text
1 <NA>
Run Code Online (Sandbox Code Playgroud)

Cri*_*ira 2

通过应用

gsub("http\\w+", "", mycorpus)
Run Code Online (Sandbox Code Playgroud)

输出具有 class = 字符,因此它适用于我的情况。