tm 包中的 DocumentTermMatrix 不返回所有单词

Question

tm 包中的 DocumentTermMatrix 不返回所有单词

我正在使用 R 中的 tm-package 创建一个文档术语矩阵，但是我的语料库中的一些单词在这个过程中丢失了。

我将用一个例子来解释。假设我有这个小语料库

library(tm)
crps <- " more hours to my next class bout to go home and go night night"
crps <- VCorpus(VectorSource(crps))

Run Code Online (Sandbox Code Playgroud)

当我DocumentTermMatrix()从 tm-package 使用时，它将返回以下结果：

dm <- DocumentTermMatrix(crps)
dm_matrix <- as.matrix(dm)
dm_matrix
# Terms
# Docs and bout class home hours more next night
# 1   1    1     1    1     1    1    1     2

Run Code Online (Sandbox Code Playgroud)

然而，我想要的（和期望的）是：

# Docs and bout class home hours more next night my  go to
#  1   1    1     1    1     1    1    1     2   1   2  1

Run Code Online (Sandbox Code Playgroud)

为什么DocumentTermMatrix()跳过“my”、“go”和“to”这些词？有没有办法控制和修复这个功能？

Answer 1

Ken*_*HBS 5

DocumentTermMatrix()自动丢弃少于三个字符的单词。因此，在构建文档术语矩阵时不考虑单词、to和my。go

从帮助页面中?DocumentTermMatrix，您可以看到有一个名为的可选参数control。这个可选参数有许多默认值（?termFreq有关更多详细信息，请参阅帮助页面）。这些默认值之一是至少三个字符的字长，即wordLengths = c(3, Inf)。您可以更改此设置以适应所有单词，无论单词长度如何：

dm <- DocumentTermMatrix(my_corpus, control = list(wordLengths=c(1, Inf))

inspect(dm)
# <<DocumentTermMatrix (documents: 1, terms: 11)>>
# Non-/sparse entries: 11/0
# Sparsity           : 0%
# Maximal term length: 5
# Weighting          : term frequency (tf)
#
#    Terms
# Docs and bout class go home hours more my next night to
#    1   1    1     1  2    1     1    1  1    1     2  2

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	1310 次
最近记录：	8 年，7 月前