我正在尝试为语料库制作 2 个文档术语矩阵,一个带有 unigrams,一个带有 bigrams。但是,bigram 矩阵目前与 unigram 矩阵完全相同,我不确定为什么。
编码:
docs<-Corpus(DirSource("data", recursive=TRUE))
# Get the document term matrices
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words",
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = TRUE))
dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer,
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = TRUE))
inspect(dtm_unigram)
inspect(dtm_bigram)
Run Code Online (Sandbox Code Playgroud)
我还尝试使用 ngram 包中的 ngram(x, n=2) 作为标记器,但这也不起作用。如何修复二元标记化?
为什么重写_generate_next_value_只有在 LAST 继承的枚举中完成才有效?
例如:
class AutoEnum(Enum):
def _generate_next_value_(name, start, count, last_values):
return 'overriding _generate_next_value'
class OtherEnum(Enum):
def some_other_method(self):
return
class AutoOther(AutoEnum, OtherEnum):
TEST = auto()
class OtherAuto(OtherEnum, AutoEnum):
TEST = auto()
print(f'{AutoOther.TEST}: mro={getmro(AutoOther)}\n'
f'name: {AutoOther.TEST.name}, value: {AutoOther.TEST.value}')
print(f'{OtherAuto.TEST}: mro={getmro(OtherAuto)}\n'
f'name: {OtherAuto.TEST.name}, value: {OtherAuto.TEST.value}')
Run Code Online (Sandbox Code Playgroud)
输出:
AutoOther.TEST: mro=(<enum 'AutoOther'>, <enum 'AutoEnum'>, <enum 'OtherEnum'>, <enum 'Enum'>, <class 'object'>)
name: TEST, value: 1
OtherAuto.TEST: mro=(<enum 'OtherAuto'>, <enum 'OtherEnum'>, <enum 'AutoEnum'>, <enum 'Enum'>, <class 'object'>)
name: TEST, value: overriding _generate_next_value
Run Code Online (Sandbox Code Playgroud)
_generate_next_value_如果每次都以某种方式设置默认值(如果没有专门覆盖它),那么在继承 Enum …