tm在应用tm_map时丢失元数据

mom*_*obo 3 metadata r tm

我对tm r库有一个(小)问题.说我有一个语料库:

# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
Run Code Online (Sandbox Code Playgroud)

结果:

[1] "1" "2" "3" "4" "5"
Run Code Online (Sandbox Code Playgroud)

这有效.但是当我尝试使用转换tm_map()时:

# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)
Run Code Online (Sandbox Code Playgroud)

Error: inherits(doc, "TextDocument") is not TRUE
Run Code Online (Sandbox Code Playgroud)

在这种情况下提出的解决方案是转换为PlainTextDocument.

# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)
Run Code Online (Sandbox Code Playgroud)

结果:

[1] "character(0)" "character(0)" "character(0)" "character(0)" "character(0)"
Run Code Online (Sandbox Code Playgroud)

现在它可以工作,但擦除所有元数据(在本例中为doc名称).有一种方法来保存元数据,或保存然后恢复它们?

mom*_*obo 8

我找到了.

这条线:

myCorpus <- tm_map(myCorpus, PlainTextDocument)
Run Code Online (Sandbox Code Playgroud)

解决了问题但删除了元数据.

我找到了这个答案,解释了使用tm_map()的更好方法.我只需要替换:

myCorpus <- tm_map(myCorpus, tolower)
Run Code Online (Sandbox Code Playgroud)

有:

myCorpus <- tm_map(myCorpus, content_transformer(tolower))
Run Code Online (Sandbox Code Playgroud)

一切正常!