CountVectorizer 将单词转换为小写

Question

CountVectorizer 将单词转换为小写

Gha*_*nem 5 python scikit-learn countvectorizer

在我的分类模型中，我需要维护大写字母，但是当我使用sklearn countVectorizer构建词汇表时，大写字母转换为小写！

为了排除隐式tokinization，我构建了一个标记器，它只传递文本而不进行任何操作..

我的代码：

co = dict()

def tokenizeManu(txt):
    return txt.split()

def corpDict(x):
    print('1: ', x)
    count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu)
    countFit = count.fit_transform(x)
    vocab = count.get_feature_names()
    dist = np.sum(countFit.toarray(), axis=0)
    for tag, count in zip(vocab, dist):
        co[str(tag)] = count

x = ['I\'m John Dev', 'We are the only']

corpDict(x)
print(co)

Run Code Online (Sandbox Code Playgroud)

输出：

1:  ["I'm John Dev", 'We are the only'] #<- before building the vocab.
{'john': 1, 'the': 1, 'we': 1, 'only': 1, 'dev': 1, "i'm": 1, 'are': 1} #<- after

Run Code Online (Sandbox Code Playgroud)

Answer 1

Moh*_*OUI 6

正如文档中所解释的，这里. CountVectorizer有一个参数lowercase默认为True. 为了禁用此行为，您需要进行lowercase=False如下设置：

count  = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu, lowercase=False)

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，11 月前
查看次数：	2775 次
最近记录：	7 年，11 月前