小编mom*_*obo的帖子

tm在应用tm_map时丢失元数据

我对tm r库有一个(小)问题.说我有一个语料库:

# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
Run Code Online (Sandbox Code Playgroud)

结果:

[1] "1" "2" "3" "4" "5"
Run Code Online (Sandbox Code Playgroud)

这有效.但是当我尝试使用转换tm_map()时:

# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)
Run Code Online (Sandbox Code Playgroud)

Error: inherits(doc, "TextDocument") is not TRUE
Run Code Online (Sandbox Code Playgroud)

在这种情况下提出的解决方案是转换为PlainTextDocument.

# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
Run Code Online (Sandbox Code Playgroud)

结果:

[1] "character(0)" "character(0)" …
Run Code Online (Sandbox Code Playgroud)

metadata r tm

3
推荐指数
1
解决办法
6670
查看次数

标签 统计

metadata ×1

r ×1

tm ×1