在Python中使用while函数将短语更改为向量

Raf*_*nez 6 python scikit-learn

我想将以下短语更改为sklearn向量:

Article 1. It is not good to eat pizza after midnight
Article 2. I wouldn't survive a day withouth stackexchange
Article 3. All of these are just random phrases
Article 4. To prove if my experiment works.
Article 5. The red dog jumps over the lazy fox
Run Code Online (Sandbox Code Playgroud)

我得到以下代码:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

n=0
while n < 5:
   n = n + 1
   a = ('Article %(number)s' % {'number': n})
   print(a)
   with open("LISR2.txt") as openfile:
     for line in openfile:
       if a in line:
           X=line
           print(vectorizer.fit_transform(X))
Run Code Online (Sandbox Code Playgroud)

这给了我以下错误:

ValueError: Iterable over raw text documents expected, string object received.
Run Code Online (Sandbox Code Playgroud)

为什么会这样?我知道这应该有效,因为如果我单独输入:

X=("It is not good to eat pizza","I wouldn't survive a day", "All of these")

print(vectorizer.fit_transform(X))
Run Code Online (Sandbox Code Playgroud)

它给了我我想要的矢量.

(0, 8)  1
(0, 2)  1
(0, 11) 1
(0, 3)  1
(0, 6)  1
(0, 4)  1
(0, 5)  1
(1, 1)  1
(1, 9)  1
(1, 12) 1
(2, 10) 1
(2, 7)  1
(2, 0)  1
Run Code Online (Sandbox Code Playgroud)

She*_*xed 10

看看文档.它表示CountVectorizer.fit_transform期望一个可迭代的字符串(例如字符串列表).您正在传递单个字符串.

有意义的是,scikit中的fit_transform做了两件事:1)它学习模型(拟合)2)它将模型应用于数据(变换).您想构建一个矩阵,其中列是词汇表中的所有单词,行对应于文档.为此,您需要知道语料库中的所有词汇(所有列).


小智 7

当您提供原始数据时出现此问题,意味着直接将字符串提供给提取函数,而您可以给Y = [X]并将此Y作为参数传递然后您将得到它正确我也面临这个问题

  • 将`X = line`改为`X = [line]` (2认同)