理解文本特征提取python scikit-learn中的TfidfVectorizer

Question

理解文本特征提取python scikit-learn中的TfidfVectorizer

阅读scikit-learn中文本特征提取的文档,我不确定TfidfVectorizer(可能是其他矢量化器)可用的不同参数如何影响结果.

以下是我不确定它们是如何工作的论据:

TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

Run Code Online (Sandbox Code Playgroud)

文档清楚地说明了stop_words/max_df的使用(两者都有相似的效果,可以使用一个而不是另一个).但是,我不确定这些选项是否应与ngrams一起使用.首先发生/处理哪一个,ngrams或stop_words？为什么？根据我的实验,首先删除停用词,但是ngrams的目的是提取短语等.我不确定这个序列的效果(Stops删除然后ngramed).

第二,将max_df/min_df参数与use_idf参数一起使用是否有意义？这些类似的目的不是？

Answer 1

Jar*_*rad 19

我在这篇文章中看到了几个问题.

TfidfVectorizer中的不同参数如何相互作用？

你真的必须用它来培养一种直觉感(无论如何都是我的经验).

TfidfVectorizer是一个单词的方法.在NLP中,单词序列及其窗口很重要; 这种破坏了一些背景.

如何控制令牌输出的内容？

设置ngram_range为(1,1)仅输出单字标记,(1,2)表示单字和双字标记,(2,3)表示双字和三字标记等.

ngram_range与...携手合作analyzer.设置analyzer为"word"以输出单词和短语,或将其设置为"char"以输出字符ngrams.

如果您希望输出同时具有"word"和"char"功能,请使用sklearn的FeatureUnion.这里的例子.

如何删除不需要的东西？

使用stop_words删除无意义的少的英语单词.

sklearn使用的停用词列表可在以下位置找到:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

Run Code Online (Sandbox Code Playgroud)

删除停用词的逻辑与这些词没有很多意义的事实有关,并且它们在大多数文本中都出现了很多:

[('the', 79808),
 ('of', 40024),
 ('and', 38311),
 ('to', 28765),
 ('in', 22020),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681),
 ('his', 10034),
 ('is', 9773),
 ('with', 9739),
 ('as', 8064),
 ('i', 7679),
 ('had', 7383),
 ('for', 6938),
 ('at', 6789),
 ('by', 6735),
 ('on', 6639)]

Run Code Online (Sandbox Code Playgroud)

由于停用词通常具有较高的频率,因此使用max_df0.95的浮点数来移除前5%可能是有意义的,但是你假设前5%是所有停止词,可能不是这种情况.这实际上取决于您的文本数据.在我的工作中,最常见的是,顶级单词或短语不是停止单词,因为我在非常特定的主题中使用密集文本(搜索查询数据).

使用min_df一个整数,除去罕见的发生的话.如果它们只出现一次或两次,它们就不会增加很多价值,而且通常都很模糊.此外,通常有很多它们如此忽略它们min_df=5可以大大减少你的内存消耗和数据大小.

我如何包含被剥离的东西？

token_pattern使用正则表达式模式\b\w\w+\b,这意味着令牌必须至少2个字符长,因此删除像"I","a"这样的单词,并删除0到9之类的数字.你还会注意到它删除了撇号

首先发生什么,ngram生成或停止删除单词？

我们来做一点测试吧.

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

docs = np.array(['what is tfidf',
        'what does tfidf stand for',
        'what is tfidf and what does it stand for',
        'tfidf is what',
        "why don't I use tfidf",
        '1 in 10 people use tfidf'])

tfidf = TfidfVectorizer(use_idf=False, norm=None, ngram_range=(1, 1))
matrix = tfidf.fit_transform(docs).toarray()

df = pd.DataFrame(matrix, index=docs, columns=tfidf.get_feature_names())

for doc in docs:
    print(' '.join(word for word in doc.split() if word not in ENGLISH_STOP_WORDS))

Run Code Online (Sandbox Code Playgroud)

打印出:

tfidf
does tfidf stand
tfidf does stand
tfidf
don't I use tfidf
1 10 people use tfidf

Run Code Online (Sandbox Code Playgroud)

现在让我们打印df:

                                           10  and  does  don  for   in   is  \
what is tfidf                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
what does tfidf stand for                 0.0  0.0   1.0  0.0  1.0  0.0  0.0   
what is tfidf and what does it stand for  0.0  1.0   1.0  0.0  1.0  0.0  1.0   
tfidf is what                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
why don't I use tfidf                     0.0  0.0   0.0  1.0  0.0  0.0  0.0   
1 in 10 people use tfidf                  1.0  0.0   0.0  0.0  0.0  1.0  0.0   

                                           it  people  stand  tfidf  use  \
what is tfidf                             0.0     0.0    0.0    1.0  0.0   
what does tfidf stand for                 0.0     0.0    1.0    1.0  0.0   
what is tfidf and what does it stand for  1.0     0.0    1.0    1.0  0.0   
tfidf is what                             0.0     0.0    0.0    1.0  0.0   
why don't I use tfidf                     0.0     0.0    0.0    1.0  1.0   
1 in 10 people use tfidf                  0.0     1.0    0.0    1.0  1.0   

                                          what  why  
what is tfidf                              1.0  0.0  
what does tfidf stand for                  1.0  0.0  
what is tfidf and what does it stand for   2.0  0.0  
tfidf is what                              1.0  0.0  
why don't I use tfidf                      0.0  1.0  
1 in 10 people use tfidf                   0.0  0.0

Run Code Online (Sandbox Code Playgroud)

笔记:

use_idf=False, norm=None当这些设置时,它相当于使用sklearn的CountVectorizer.它只会返回计数.
请注意,"不要"这个词被转换为"don".在这里,您会改变token_pattern为类似token_pattern=r"\b\w[\w']+\b"包括撇号.
我们看到很多停顿词

让我们删除停用词并再次查看df:

tfidf = TfidfVectorizer(use_idf=False, norm=None, stop_words='english', ngram_range=(1, 2))

Run Code Online (Sandbox Code Playgroud)

输出:

                                           10  10 people  does  does stand  \
what is tfidf                             0.0        0.0   0.0         0.0   
what does tfidf stand for                 0.0        0.0   1.0         0.0   
what is tfidf and what does it stand for  0.0        0.0   1.0         1.0   
tfidf is what                             0.0        0.0   0.0         0.0   
why don't I use tfidf                     0.0        0.0   0.0         0.0   
1 in 10 people use tfidf                  1.0        1.0   0.0         0.0   

                                          does tfidf  don  don use  people  \
what is tfidf                                    0.0  0.0      0.0     0.0   
what does tfidf stand for                        1.0  0.0      0.0     0.0   
what is tfidf and what does it stand for         0.0  0.0      0.0     0.0   
tfidf is what                                    0.0  0.0      0.0     0.0   
why don't I use tfidf                            0.0  1.0      1.0     0.0   
1 in 10 people use tfidf                         0.0  0.0      0.0     1.0   

                                          people use  stand  tfidf  \
what is tfidf                                    0.0    0.0    1.0   
what does tfidf stand for                        0.0    1.0    1.0   
what is tfidf and what does it stand for         0.0    1.0    1.0   
tfidf is what                                    0.0    0.0    1.0   
why don't I use tfidf                            0.0    0.0    1.0   
1 in 10 people use tfidf                         1.0    0.0    1.0   

                                          tfidf does  tfidf stand  use  \
what is tfidf                                    0.0          0.0  0.0   
what does tfidf stand for                        0.0          1.0  0.0   
what is tfidf and what does it stand for         1.0          0.0  0.0   
tfidf is what                                    0.0          0.0  0.0   
why don't I use tfidf                            0.0          0.0  1.0   
1 in 10 people use tfidf                         0.0          0.0  1.0   

                                          use tfidf  
what is tfidf                                   0.0  
what does tfidf stand for                       0.0  
what is tfidf and what does it stand for        0.0  
tfidf is what                                   0.0  
why don't I use tfidf                           1.0  
1 in 10 people use tfidf                        1.0

Run Code Online (Sandbox Code Playgroud)

外卖:

令牌"不要使用"发生,因为don't I use有't剥离,因为I少于两个字符是,它已被删除等等字眼被加入到don use......这实际上不是结构并可能改变结构了一下!
答案:删除停用词,删除短字符,然后生成可返回意外结果的ngrams.

将max_df/min_df参数与use_idf参数一起使用是否有意义？

我认为,术语 - 频率逆文档频率的整个点是允许重新加权高频词(将出现在排序频率列表顶部的词).这种重新加权将采用最高频率的ngrams并将它们从列表中移到较低的位置.因此,它应该处理max_df场景.

也许更多的是个人选择是否要将它们移到列表中("重新加权"/取消优先级)或完全删除它们.

我使用min_df了很多,min_df如果你正在处理一个庞大的数据集,那么它是有意义的,因为罕见的单词不会增加价值,只会导致很多处理问题.我并没有max_df太多使用,但我确信在使用像维基百科这样的数据时,有些情况可能会删除顶部的x%.

归档时间：	7 年，10 月前
查看次数：	5402 次
最近记录：	7 年，10 月前