标签: countvectorizer

Python 中单词组合的向量化

我有一个包含医学文本数据的数据集，我对它们应用 tf-idf 矢量器并计算单词的 tf idf 分数，如下所示：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer as tf

vect = tf(min_df=60,stop_words='english')

dtm = vect.fit_transform(df) 
l=vect.get_feature_names() 

x=pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

Run Code Online (Sandbox Code Playgroud)

所以基本上我的问题如下 - 当我应用 TfidfVectorizer 时，它会将文本分割成不同的单词，例如：“疼痛”、“头痛”、“恶心”等。如何获得 TfidfVectorizer 输出中的单词组合，例如：“剧烈疼痛”、“丛集性头痛”、“恶心呕吐”。谢谢

python nlp tf-idf scikit-learn countvectorizer

Kei*_*thx

2017 08-16

4
推荐指数

1
解决办法

650
查看次数

Scala Spark - 将矢量列拆分为Spark DataFrame中的单独列

我有一个Spark DataFrame,其中我有一个Vector值列.矢量值都是n维的,也就是具有相同的长度.我还有一个列名列表Array("f1", "f2", "f3", ..., "fn"),每个列对应于向量中的一个元素.

some_columns... | Features
      ...       | [0,1,0,..., 0]

to

some_columns... | f1 | f2 | f3 | ... | fn

      ...       | 0  | 1  | 0  | ... | 0

Run Code Online (Sandbox Code Playgroud)

实现这一目标的最佳方法是什么？我想到了一种方法,即创建一个新的DataFrame createDataFrame(Row(Features), featureNameList)然后与旧的DataFrame 连接,但它需要spark context来使用createDataFrame.我只想转换现有的数据框.我也知道,.withColumn("fi", value)但如果n很大我该怎么办？

我是Scala和Spark的新手,并没有找到任何好的例子.我认为这可能是一项常见任务.我的具体情况是我使用了CountVectorizer并且希望单独恢复每个列以获得更好的可读性,而不是仅仅具有向量结果.

scala dataframe apache-spark countvectorizer

Log*_*ang

2018 04-19

4
推荐指数

1
解决办法

3635
查看次数

在 countvectorizer() 中找不到 get_feature_names

我正在挖掘有关深度学习库的帖子的 Stack Overflow 数据转储。我想识别语料库中的停用词（例如“python”）。我想要获取我的特征名称，以便我可以识别术语频率最高的单词。

我按如下方式创建文档和语料库：

with open("StackOverflow_2018_Data.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    pytorch_doc = ''
    tensorflow_doc = ''
    cotag_list = []
    keras_doc = ''
    counte = 0
    for row in csv_reader:
        if row[2] == 'tensorflow':
            tensorflow_doc += row[3] + ' '
        if row[2] == 'keras':
            keras_doc += row[3] + ' '
        if row[2] == 'pytorch':
            pytorch_doc += row[3] + ' '

corpus = [pytorch_doc, tensorflow_doc, keras_doc]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)
print(x)
x.toarray()
Dict = …

Run Code Online (Sandbox Code Playgroud)

python pandas sklearn-pandas countvectorizer

mad*_*die

2019 04-05

4
推荐指数

1
解决办法

2万
查看次数

使用 CountVectorizer、TFIDFVectorizer 计算列表之间的文本相似度

我希望看到使用TFIDFVectorizer和的列表之间的相似性CountVectorizer。

我有如下列表：

list1 = [['i','love','machine','learning','its','awesome'],
         ['i', 'love', 'coding', 'in', 'python'],
         ['i', 'love', 'building', 'chatbots']]
list2 = ['i', 'love', 'chatbots']

Run Code Online (Sandbox Code Playgroud)

我希望看到list1[0]and list2、list1[1]and list2、list1[2]and之间的相似性list2。

期望输出应该是这样的[0.99 , 0.67, 0.54]

python gensim scikit-learn countvectorizer tfidfvectorizer

Pra*_*een

2020 06-19

4
推荐指数

1
解决办法

2318
查看次数

CountVectorizer的单个字母的空词汇表

尝试将字符串转换为数字矢量，

### Clean the string
def names_to_words(names):
    print('a')
    words = re.sub("[^a-zA-Z]"," ",names).lower().split()
    print('b')

    return words


### Vectorization
def Vectorizer():
    Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000)
    return Vectorizer  


### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()

Run Code Online (Sandbox Code Playgroud)

但是当我陶醉时：

 ['g', 'o', 'm', 'd']

Run Code Online (Sandbox Code Playgroud)

有错误：

ValueError: empty vocabulary; perhaps the documents only contain stop words

Run Code Online (Sandbox Code Playgroud)

这样的单字母字符串似乎存在问题。我该怎么办？谢谢

python nlp vectorization feature-extraction countvectorizer

Loo*_*ast

lucky-day

3
推荐指数

1
解决办法

2051
查看次数

sklearn 模型数据转换错误：CountVectorizer - 未安装词汇

我已经训练了一个主题分类模型。然后当我要将新数据转换为向量进行预测时，它出错了。它显示“NotFittedError: CountVectorizer - Vocabulary is not fit”。但是当我通过将训练数据拆分为训练模型中的测试数据来进行预测时，它起作用了。下面是代码：

from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

# read new dataset
testdf = pd.read_csv('C://Users/KW198/Documents/topic_model/training_data/testdata.csv', encoding='cp950')

testdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 2 columns):
keywords    1800 non-null object
topics      1800 non-null int64
dtypes: int64(1), object(1)
memory usage: 28.2+ KB

# read columns
kw = testdf['keywords']
label = testdf['topics']

# ?????????
vectorizer = CountVectorizer(min_df=1, stop_words='english')
x_testkw_vec = vectorizer.transform(kw)

Run Code Online (Sandbox Code Playgroud)

这是一个错误

--------------------------------------------------------------------------- …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn text-classification countvectorizer

Ken*_*ieh

2018 03-29

3
推荐指数

1
解决办法

3598
查看次数

使用 countVectorizer 在 python 中计算我自己词汇的单词出现次数

Doc1: ['And that was the fallacy. Once I was free to talk with staff members']

Doc2: ['In the new, stripped-down, every-job-counts business climate, these human']

Doc3 : ['Another reality makes emotional intelligence ever more crucial']

Doc4: ['The globalization of the workforce puts a particular premium on emotional']

Doc5: ['As business changes, so do the traits needed to excel. Data tracking']

Run Code Online (Sandbox Code Playgroud)

这是我的词汇示例：

my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped-down’, ‘ever more’, ‘of the workforce’, ‘the traits needed’]

Run Code Online (Sandbox Code Playgroud)

关键是我词汇表中的每个单词都是二元词或三元词。我的词汇包括我的文档集中所有可能的二元词和三元词，我只是在这里给了你一个样本。根据应用程序，这就是我的词汇应该是怎样的。我正在尝试使用 countVectorizer 如下：

from …

Run Code Online (Sandbox Code Playgroud)

python countvectorizer

nig*_*ain

2018 04-03

3
推荐指数

1
解决办法

6400
查看次数

CountVectorizer上的词法化不会删除停用词

我正在尝试从Skit-learn向CountVectorizer添加Lematization，如下所示

import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text)]

vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())

sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]

vectorizer.fit_transform(sentence)

Run Code Online (Sandbox Code Playgroud)

这是输出：

[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']

Run Code Online (Sandbox Code Playgroud)

更新

这是出现的停用词，已经过词缀化：

u'lar'，u'ler'，u'der'

它限制所有单词，并且不会删除停用词。那么，有什么想法吗？

nltk stop-words lemmatization scikit-learn countvectorizer

Amb*_*us9

2018 08-25

3
推荐指数

1
解决办法

1946
查看次数

如何在不重复构造函数中的所有参数的情况下，在 scikit-learn 中对矢量化器进行子类化

我正在尝试通过子类化CountVectorizer. 向量化器会在计算词频之前对句子中的所有词进行词干。然后我在管道中使用这个矢量化器，当我这样做时它工作正常pipeline.fit(X,y)。

但是，当我尝试使用设置参数时pipeline.set_params(rf__verbose=1).fit(X,y)，出现以下错误：

RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'features.extraction.labels.StemmedCountVectorizer'> with constructor (self, *args, **kwargs) doesn't  follow this convention.

Run Code Online (Sandbox Code Playgroud)

这是我的自定义矢量化器：

class StemmedCountVectorizer(CountVectorizer):
    def __init__(self, *args, **kwargs):
        self.stemmer = SnowballStemmer("english", ignore_stopwords=True)
        super(StemmedCountVectorizer, self).__init__(*args, **kwargs)

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([' '.join([self.stemmer.stem(w) for w in word_tokenize(word)]) for word in analyzer(doc)])

Run Code Online (Sandbox Code Playgroud)

我知道我可以设置类的每个参数，CountVectorizer但它似乎不遵循 DRY 原则。

谢谢你的帮助！

python subclass python-3.x scikit-learn countvectorizer

nbe*_*hat

2018 07-20

3
推荐指数

1
解决办法

928
查看次数

Pyspark更新特征向量中的值

我正在构建文本分类器并使用 Spark countVectorizer 来创建特征向量。

现在要将此向量与 BIDGL 库一起使用，我需要将特征向量中的所有 0 转换为 1。

这是我的特征向量，它是一个稀疏向量：

vectorizer_df.select('features').show(2)
+--------------------+
|            features|
+--------------------+
|(1000,[4,6,11,13,...|
|(1000,[0,1,2,3,4,...|
+--------------------+
only showing top 2 rows

Run Code Online (Sandbox Code Playgroud)

我正在尝试更新该值，如下所示。首先将稀疏向量转换为稠密向量

from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.types import ArrayType, FloatType
from pyspark.sql.functions import udf

update_vector = udf(lambda vector: Vectors.dense(vector), VectorUDT())


df = vectorizer_df.withColumn('features',update_vector(vectorizer_df.features))

df.select('features').show(2)
+--------------------+
|            features|
+--------------------+
|[0.0,0.0,0.0,0.0,...|
|[5571.0,4688.0,24...|
+--------------------+
only showing top 2 rows

Run Code Online (Sandbox Code Playgroud)

一旦我有了稠密向量，我就尝试给所有元素加 1

def add1(x):
    return x+1
def array_for(x):
    return np.array([add1(xi) for xi in x])

add_udf_one = udf(lambda z: array_for(z), …

Run Code Online (Sandbox Code Playgroud)

feature-selection apache-spark pyspark countvectorizer

Pra*_*een

lucky-day

3
推荐指数

1
解决办法

1810
查看次数

标签统计

countvectorizer ×10

python ×7

scikit-learn ×5

apache-spark ×2

nlp ×2

dataframe ×1

feature-extraction ×1

feature-selection ×1

gensim ×1

lemmatization ×1

machine-learning ×1

nltk ×1

pandas ×1

pyspark ×1

python-3.x ×1

scala ×1

sklearn-pandas ×1

stop-words ×1

subclass ×1

text-classification ×1

tf-idf ×1

tfidfvectorizer ×1

vectorization ×1

标签 统计

标签统计