use*_*599 6 r matrix text-mining tm
在RI中使用tm-package创建一个Document-Term-Matrix:
dtm <- DocumentTermMatrix(cor, control = list(dictionary=c("someTerm")))
Run Code Online (Sandbox Code Playgroud)
哪个结果是这样的:
A document-term matrix (291 documents, 1 terms)
Non-/sparse entries: 48/243
Sparsity : 84%
Maximal term length: 8
Weighting : term frequency (tf)
Terms
Docs someTerm
doc1 0
doc2 0
doc3 7
doc4 22
doc5 0
Run Code Online (Sandbox Code Playgroud)
现在我想根据文档中someTerm的出现次数过滤这个Document-Term-Matrix.例如,仅过滤someTerm至少出现一次的文档.即doc3和doc4.
我怎样才能做到这一点?
它与您对常规R矩阵进行子集的方式非常相似.例如,要从示例路透社数据集创建一个文档术语矩阵,其中只有行"似乎"出现多次:
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain))
dtm <- DocumentTermMatrix(reuters)
v <- as.vector(dtm[,"would"]>1)
dtm2 <- dtm[v, ]
> inspect(dtm2[, "would"])
A document-term matrix (3 documents, 1 terms)
Non-/sparse entries: 3/0
Sparsity : 0%
Maximal term length: 5
Weighting : term frequency (tf)
Terms
Docs would
246 2
489 2
502 2
Run Code Online (Sandbox Code Playgroud)
一个tm文档词矩阵是从包一个简单的三重态基质slam所以slam文档中找出如何操作的DTM帮助.