标签: countvectorizer

Sklearn:将lemmatizer添加到CountVectorizer

我在我的计数器中添加了词形还原,正如Sklearn页面上所解释的那样.

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
                       strip_accents = 'unicode',
                       stop_words = 'english',
                       lowercase = True,
                       token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                       max_df = 0.5,
                       min_df = 10)

Run Code Online (Sandbox Code Playgroud)

但是,在创建时DTM使用fit_transform,我得到的错误如下(其中我也没有什么意义).在将词形还原添加到我的矢量化器之前,dtm代码始终有效.我深入研究了手册,并尝试了一些代码,但找不到任何解决方案.

dtm_tf = tf_vectorizer.fit_transform(articles)

Run Code Online (Sandbox Code Playgroud)

更新:

按照下面的@ MaxU的建议,代码运行没有错误,但数字和标点符号没有从我的输出中省略.我运行单独的测试,看看以后哪些功能有效,哪些LemmaTokenizer()无效.结果如下:

strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase …

Run Code Online (Sandbox Code Playgroud)

python lemmatization scikit-learn countvectorizer

Ren*_*ens

2017 11-24

3
推荐指数

1
解决办法

6882
查看次数

将字数向量逆变换为原始文档

我正在训练一个简单的文本分类模型（目前使用 scikit-learn）。使用我使用的词汇表将我的文档样本转换为字数向量

CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)

从sklearn.feature_extraction.text。

这非常有效，我随后可以将此字数向量作为特征向量来训练我的分类器。但我不知道如何将这些字数向量逆变换为原始文档。CountVectorizer确实有一个函数inverse_transform(X)，但这只会返回唯一的非零标记。

据我所知 CountVectorizer 没有任何映射回原始文档的实现。

有人知道如何从计数向量化表示中恢复令牌的原始序列吗？是否有 Tensorflow 或任何其他模块可以实现此目的？

nlp tf-idf scikit-learn tensorflow countvectorizer

ant*_*hka

lucky-day

2
推荐指数

1
解决办法

2646
查看次数

CountVectorizer给出错误的单词计数？

假设我的文本文件包含以下文本：

敏捷的棕狐跳过了懒狗。小洞不补，大洞吃苦。快速的棕色针脚跳过了懒惰的时光。狐狸及时救了一条狗。

我想使用sk-learn的CountVectorizer来获取文件中所有单词的单词计数。（我知道还有其他方法可以执行此操作，但是出于某些原因，我想使用CountVectorizer。）这是我的代码：

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

text = input('Please enter the filepath for the text: ') 
text = open(text, 'r', encoding = 'utf-8')
tokens = CountVectorizer(analyzer = 'word', stop_words = 'english')


X = tokens.fit_transform(text)
dictionary = tokens.vocabulary_

Run Code Online (Sandbox Code Playgroud)

除了打电话时dictionary，它给了我错误的计数：

>>> dictionary
{'time': 9, 'dog': 1, 'stitch': 8, 'quick': 6, 'lazy': 5, 'brown': 0, 'saves': 7, 'jumped': 4, 'fox': 3, 'dogs': 2}

Run Code Online (Sandbox Code Playgroud)

有人可以建议我在这里犯下的（毫无疑问的）错误吗？

python nlp nltk scikit-learn countvectorizer

Lod*_*e66

lucky-day

2
推荐指数

1
解决办法

1378
查看次数

CountVectorizer MultinomialNB中的尺寸不匹配错误

在提出这个问题之前，我不得不说，我已经在此板上彻底阅读了15个以上的相似主题，每个主题都有不同的建议，但是所有这些都无法使我正确。

好的，所以我使用CountVectorizer及其“ fit_transform”函数将语料库的文本数据（最初为csv格式）分为训练集和测试集，以适应语料库的词汇量并从文本中提取字数统计功能。然后，我应用MultinomialNB（）从训练集中学习并预测测试集。这是我的代码（简体）：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB

# loading data 
# data contains two columns ('text', 'target')

spam = pd.read_csv('spam.csv')
spam['target'] = np.where(spam_data['target']=='spam',1,0)

# split data
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) 

# fit vocabulary and extract word count features
cv = CountVectorizer()
X_traincv = cv.fit_transform(X_train)  
X_testcv = cv.fit_transform(X_test)

# learn and predict using MultinomialNB
clfNB = MultinomialNB(alpha=0.1)
clfNB.fit(X_traincv, y_train)

# so far so good, but when I predict on …

Run Code Online (Sandbox Code Playgroud)

python naivebayes train-test-split countvectorizer

Chr*_* T.

2019 03-01

2
推荐指数

1
解决办法

2443
查看次数

raise ValueError("np.nan 是一个无效的文档，预期的字节或"

我在 scikit-learn 中使用 CountVectorizer 对特征序列进行矢量化。当它给出如下错误时我被卡住了：ValueError: np.nan is an invalid document, expected byte or unicode string。

我正在拿一个包含两列内容和情绪的示例 csv 数据集。我的代码如下：

df = pd.read_csv("train.csv",encoding = "ISO-8859-1")
X, y = df.CONTENT, df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print X_train, y_train

vect = CountVectorizer(ngram_range=(1,3), analyzer='word', encoding = "ISO-8859-1")
print vect
X=vect.fit_transform(X_train, y_train)
y=vect.fit(X_test) 
print vect.get_feature_names()

Run Code Online (Sandbox Code Playgroud)

我得到的错误是：

File "C:/Users/HP/cntVect.py", line 28, in <module>
    X=vect.fit_transform(X_train, y_train)

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
    for feature in analyze(doc):

  File …

Run Code Online (Sandbox Code Playgroud)

python pandas scikit-learn countvectorizer

Sad*_*ngh

2018 03-13

2
推荐指数

1
解决办法

9429
查看次数

用作 TfidfTransformer 输入的 CountVectorizer 输出与 TfidfTransformer()

最近，我开始阅读更多有关 NLP 的内容并遵循 Python 教程，以便更多地了解该主题。在学习其中一个教程时，我观察到他们使用每条推文中字数的稀疏矩阵（使用 CountVectorizer 创建）作为 TfidfTransformer 的输入，TfidfTransformer 处理数据并将其提供给分类器进行训练和预测。

pipeline = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', LogisticRegression())
])

Run Code Online (Sandbox Code Playgroud)

由于没有提供任何解释，我无法理解这背后的思维过程......这不是一个普通的词袋吗？难道不能仅使用其中一个函数（例如 Tfidf）来完成此操作吗？

任何澄清将不胜感激。

python pipeline scikit-learn countvectorizer tfidfvectorizer

pat*_*tri

2019 02-19

2
推荐指数

1
解决办法

776
查看次数

在 Python 中结合 CountVectorizer 和 ngrams

有一项使用 ngrams 对男性和女性名字进行分类的任务。所以，有一个数据框，如：

    name    is_male
Dorian      1
Jerzy       1
Deane       1
Doti        0
Betteann    0
Donella     0

Run Code Online (Sandbox Code Playgroud)

具体要求是使用

from nltk.util import ngrams

Run Code Online (Sandbox Code Playgroud)

对于这个任务，创建 ngrams (n=2,3,4)

我列了一个名字列表，然后使用了 ngrams：

from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

test_ngrams = []
for name in name_list:
    test_ngrams.append(list(ngrams(name,3)))

Run Code Online (Sandbox Code Playgroud)

现在我需要以某种方式矢量化所有这些以用于分类，我尝试

X_train = count_vect.fit_transform(test_ngrams)

Run Code Online (Sandbox Code Playgroud)

收到：

AttributeError: 'list' object has no attribute 'lower'

Run Code Online (Sandbox Code Playgroud)

我知道这里的列表是错误的输入类型，有人可以解释我应该怎么做，所以我以后可以使用 MultinomialNB，例如。我这样做是否正确？提前致谢！

python nltk scikit-learn countvectorizer

Ale*_*tin

2017 12-20

1
推荐指数

1
解决办法

5588
查看次数

为列表中的句子创建单词词典

我有一个句子清单

a = [['i am a testing'],['we are working on project']]

我正在尝试为列表中的所有句子创建单词词典。我试过了

vectorizer = CountVectorizer()
vectorizer.fit_transform(a)
coffee_dict2 = vectorizer.vocabulary_

Run Code Online (Sandbox Code Playgroud)

我收到一个错误 AttributeError: 'list' object has no attribute 'lower'

我期望的结果是一本字典

{'i': 1, 'am': 1, 'testing': 2}

python nltk pandas scikit-learn countvectorizer

fun*_*unk

2019 09-17

1
推荐指数

1
解决办法

39
查看次数

从数组和列表中获取各种令牌计数统计信息的更有效方法

我正在从电子邮件文本列表（以 csv 格式存储）中对垃圾邮件进行分类，但在此之前，我想从输出中获取一些简单的计数统计信息。我使用 sklearn 的 CountVectorizer 作为第一步，并通过以下代码实现

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

#import data from csv

spam = pd.read_csv('spam.csv')
spam['Spam'] = np.where(spam['Spam']=='spam',1,0)

#split data

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) 

#convert 'features' to numeric and then to matrix or list
cv = CountVectorizer()
x_traincv = cv.fit_transform(X_train)
a = x_traincv.toarray()
a_list = cv.inverse_transform(a)

Run Code Online (Sandbox Code Playgroud)

输出以矩阵（名为“a”）或数组列表（名为“a_list”）格式存储，如下所示

[array(['do', 'I', 'off', 'text', 'where', 'you'], 
       dtype='<U32'),
 array(['ages', 'will', 'did', 'driving', 'have', 'hello', 'hi', …

Run Code Online (Sandbox Code Playgroud)

python arrays scikit-learn countvectorizer

Chr*_* T.

2017 08-21

0
推荐指数

1
解决办法

1454
查看次数

标签统计

countvectorizer ×9

python ×8

scikit-learn ×8

nltk ×3

nlp ×2

pandas ×2

arrays ×1

lemmatization ×1

naivebayes ×1

pipeline ×1

tensorflow ×1

tf-idf ×1

tfidfvectorizer ×1

train-test-split ×1

标签 统计

标签统计