Gha*_*nem 5 python scikit-learn countvectorizer
在我的分类模型中,我需要维护大写字母,但是当我使用sklearn countVectorizer构建词汇表时,大写字母转换为小写!
为了排除隐式tokinization,我构建了一个标记器,它只传递文本而不进行任何操作..
我的代码:
co = dict()
def tokenizeManu(txt):
return txt.split()
def corpDict(x):
print('1: ', x)
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu)
countFit = count.fit_transform(x)
vocab = count.get_feature_names()
dist = np.sum(countFit.toarray(), axis=0)
for tag, count in zip(vocab, dist):
co[str(tag)] = count
x = ['I\'m John Dev', 'We are the only']
corpDict(x)
print(co)
Run Code Online (Sandbox Code Playgroud)
输出:
1: ["I'm John Dev", 'We are the only'] #<- before building the vocab.
{'john': 1, 'the': 1, 'we': 1, 'only': 1, 'dev': 1, "i'm": 1, 'are': 1} #<- after
Run Code Online (Sandbox Code Playgroud)
正如文档中所解释的,这里. CountVectorizer有一个参数lowercase默认为True. 为了禁用此行为,您需要进行lowercase=False如下设置:
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu, lowercase=False)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2775 次 |
| 最近记录: |