tum*_*eed 1 nlp numpy machine-learning scipy scikit-learn
我有一个NLP任务,我正在使用scikit-learn.阅读我发现的教程必须对文本进行矢量化以及如何使用此向量化模型来提供分类算法.假设我有一些文本,我想将其矢量化如下:
from sklearn.feature_extraction.text import CountVectorizer
corpus =['''Computer science is the scientific and
practical approach to computation and its applications.'''
#this is another opinion
'''It is the systematic study of the feasibility, structure,
expression, and mechanization of the methodical
procedures that underlie the acquisition,
representation, processing, storage, communication of,
and access to information, whether such information is encoded
as bits in a computer memory or transcribed in genes and
protein structures in a biological cell.'''
#anotherone
'''A computer scientist specializes in the theory of
computation and the design of computational systems''']
vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(corpus)
print X
Run Code Online (Sandbox Code Playgroud)
问题是我不理解输出的含义,我没有看到与textizer和vectorizer返回的矩阵的任何关系:
(0, 12) 3
(0, 33) 1
(0, 20) 3
(0, 45) 7
(0, 34) 1
(0, 2) 6
(0, 28) 1
(0, 4) 1
(0, 47) 2
(0, 10) 2
(0, 22) 1
(0, 3) 1
(0, 21) 1
(0, 42) 1
(0, 40) 1
(0, 26) 5
(0, 16) 1
(0, 38) 1
(0, 15) 1
(0, 23) 1
(0, 25) 1
(0, 29) 1
(0, 44) 1
(0, 49) 1
(0, 1) 1
: :
(0, 30) 1
(0, 37) 1
(0, 9) 1
(0, 0) 1
(0, 19) 2
(0, 50) 1
(0, 41) 1
(0, 14) 1
(0, 5) 1
(0, 7) 1
(0, 18) 4
(0, 24) 1
(0, 27) 1
(0, 48) 1
(0, 17) 1
(0, 31) 1
(0, 39) 1
(0, 6) 1
(0, 8) 1
(0, 35) 1
(0, 36) 1
(0, 46) 1
(0, 13) 1
(0, 11) 1
(0, 43) 1
Run Code Online (Sandbox Code Playgroud)
此外,我不明白当我使用该toarray()方法时输出发生了什么:
print X.toarray()
Run Code Online (Sandbox Code Playgroud)
究竟什么意思是输出以及与语料库有什么关系?:
[[1 1 6 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 4 2 3 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 7 1 2 1 1 1]]
Run Code Online (Sandbox Code Playgroud)
小智 5
在CountVectorizer产生文件长期矩阵.举一个简单的例子,让我们来看看下面的简化代码:
from sklearn.feature_extraction.text import CountVectorizer
corpus =['''computer hardware''',
'''computer data and software data''']
vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(corpus)
print X
print X.toarray()
Run Code Online (Sandbox Code Playgroud)
你有两个文件,语料库的元素,和五个术语,单词.您可以按如下方式计算文档中的条款:
| and computer data hardware software
+-------------------------------------
doc 0 | 1 1
doc 1 | 1 1 2 1
Run Code Online (Sandbox Code Playgroud)
和X表示在关联的方式在上述矩阵,即,从(行,列)的地图术语的频率和X.toarray()节目X作为列表的列表.以下是执行结果:
(1, 0) 1
(0, 1) 1
(1, 1) 1
(1, 2) 2
(0, 3) 1
(1, 4) 1
[[0 1 0 1 0]
[1 1 2 0 1]]
Run Code Online (Sandbox Code Playgroud)
如@dmcc所述,您省略了使逗号corpus只有一个文档的逗号.
| 归档时间: |
|
| 查看次数: |
92 次 |
| 最近记录: |