如何使用sklearn的CountVectorizer进行矢量化和反矢量化?

Eka*_*Eka 1 python scikit-learn sklearn-pandas

我想将一些文本向量化为相应的整数,然后将这些文本转换为其映射的整数,并使用新的输入整数创建新的句子[2,9,39,46,56,12,89,9]

我已经看到了一些可用于此目的的自定义函数,但我想知道sklearn本身是否具有这样的函数。

from sklearn.feature_extraction.text import CountVectorizer

a=["""Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Morbi imperdiet mauris posuere, condimentum odio et, volutpat orci.
Curabitur sodales vulputate eros eu gravida. Sed pharetra imperdiet nunc et tempor.
Nullam lectus est, rhoncus vitae lacus at, fermentum aliquam metus.
Phasellus a sollicitudin tortor, non tempor nulla.
Etiam mattis felis enim, a malesuada ligula dignissim at.
Integer congue dolor ut magna blandit, lobortis consequat ante aliquam.
Nulla imperdiet libero eget lorem sagittis, eget iaculis orci dignissim. 
Phasellus sit amet sodales odio. Pellentesque commodo tempor risus, et tincidunt neque. 
Praesent et sem velit. Maecenas id risus sit amet ex convallis ultrices vel sed purus. 
Sed fringilla, leo quis congue sollicitudin, mauris nunc vehicula mi, et laoreet ligula 
urna et nulla. Nam sollicitudin urna sed dolor vehicula euismod. Mauris bibendum pulvinar
ornare. In suscipit sed mi ut posuere.
Proin egestas, nibh ut egestas mattis, ipsum nulla bibendum enim, ac suscipit nisl justo 
id metus. Nam est dui, elementum eget suscipit nec, aliquam in mi. Integer tortor erat,
aliquet at sapien et, fringilla posuere leo. Praesent non congue est. Vivamus tincidunt
tellus eu placerat tincidunt. Phasellus convallis lacus vitae ex congue efficitur.
Sed ut bibendum massa, vitae molestie ligula. Phasellus purus felis, fermentum vitae 
hendrerit vel, vulputate quis metus."""]


vec = CountVectorizer()
dtm=vec.fit_transform(a)
print vec.vocabulary_

#convert text to corresponding vectors
mapped_a=

#new sentence using below mapped values
#input [2,9,39,46,56,12,89,9]
#creating sentence using specific sequence

new_sentence=
Run Code Online (Sandbox Code Playgroud)

Jak*_*ina 5

要将句子向量化为整数,可以使用transform函数。此函数的输出是具有每个术语计数的向量-特征向量。

vec = CountVectorizer()
vec.fit(a)
print vec.vocabulary_

new_sentence = "dolor nulla enim"
mapped_a = vec.transform([new_sentence])
print mapped_a.toarray() # sparse feature vector

tokenizer = vec.build_tokenizer()
# array of words ids
for token in tokenizer(new_sentence):
    print vec.vocabulary_.get(token)
Run Code Online (Sandbox Code Playgroud)

问题的第二部分不是那么简单。CountVectorizer具有inverse_transform用于此目的的功能,将稀疏的特征向量作为输入。但是,在您的示例中,您想创建一个句子,在该句子中可能会出现相同的术语,而使用该功能是不可能的。

但是,解决方案是使用词汇(单词到id)并基于其构建逆词汇(单词到单词)。CountVectorizer默认情况下为no inverse_vocabulary,您必须基于创建它vocabulary

input = [2,9,9]

# 1. inverse_transform function
# create sparse vector
sparse_input = [1 if i in input else 0 for i in range(0, len(vec.vocabulary_))]
print vec.inverse_transform(sparse_input)
> ['aliquam', 'commodo']


# 2. Inverse vocabulary - custom solution
terms = np.array(list(vec.vocabulary_.keys()))
indices = np.array(list(vec.vocabulary_.values()))
inverse_vocabulary = terms[np.argsort(indices)]

for i in input:
    print inverse_vocabulary[i]
> ['aliquam', 'commodo', 'commodo']
Run Code Online (Sandbox Code Playgroud)