解释文档中单词的TF-IDF分数之和

alv*_*vas 18 python statistics nlp tf-idf gensim

首先,让我们每个文档每个术语提取TF-IDF分数:

from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
Run Code Online (Sandbox Code Playgroud)

打印出来:

for doc in corpus_tfidf:
    print doc
Run Code Online (Sandbox Code Playgroud)

[OUT]:

[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
Run Code Online (Sandbox Code Playgroud)

如果我们想要找到该语料库中单词的"显着性"或"重要性",我们可以简单地对所有文档中的tf-idf分数求和,并将其除以文档数量吗?

>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
...     for word, score in doc:
...         tfidf_saliency[word] += score / len(corpus_tfidf)
... 
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})
Run Code Online (Sandbox Code Playgroud)

看一下输出,我们可以假设语料库中最"突出"的词是:

>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'
Run Code Online (Sandbox Code Playgroud)

如果是这样,对文件中单词的TF-IDF分数总和的数学解释是什么?

sto*_*vfl 6

语料库中TF-IDF的解释是语料库中给定术语的最高TF-IDF。

在 corpus_tfidf 中查找热门词。

    topWords = {}
    for doc in corpus_tfidf:
        for iWord, tf_idf in doc:
            if iWord not in topWords:
                topWords[iWord] = 0

            if tf_idf > topWords[iWord]:
                topWords[iWord] = tf_idf

    for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
        print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
        if i == 6: break
Run Code Online (Sandbox Code Playgroud)

输出比较购物车:
注意:无法使用gensim, 来创建dictionarycorpus_tfidf.
只能显示 Word Indizies。

Question tfidf_saliency   topWords(corpus_tfidf)  Other TF-IDF implentation  
---------------------------------------------------------------------------  
1: Word(7)   0.121        1: Word(13)    0.640    1: paths         0.376019  
2: Word(8)   0.111        2: Word(27)    0.632    2: intersection  0.376019  
3: Word(26)  0.108        3: Word(28)    0.632    3: survey        0.366204  
4: Word(29)  0.100        4: Word(8)     0.628    4: minors        0.366204  
5: Word(9)   0.090        5: Word(29)    0.628    5: binary        0.300815  
6: Word(14)  0.087        6: Word(11)    0.544    6: generation    0.300815  
Run Code Online (Sandbox Code Playgroud)

TF-IDF 的计算总是考虑到语料库。

用 Python 测试:3.4.2


mhb*_*ari 2

有两个上下文可以在其中计算显着性。

  1. 语料库中的显着性
  2. 单个文档中的显着性

语料库中的显着性可以通过计算语料库中特定单词的出现次数或通过对该单词出现的文档进行逆计数来计算(IDF=倒排文档频率)。因为具有特定含义的词语并不是随处出现的。

文档中的显着性是通过tf_idf计算的。因为那是由两种信息组成的。全局信息(基于语料库)和局部信息(基于文档)。声称“文档内频率较大的单词在当前文档中更重要”并不完全正确或错误,因为它取决于单词的全局显着性。在特定的文档中,您有许多频率很高的单词,例如“it、is、am、are、...”。但这些词在任何文档中都不重要,您可以将它们作为停用词!

- - 编辑 - -

分母 (=len(corpus_tfidf)) 是一个常数值,如果您想要处理序数而不是测量基数,则可以忽略该值。另一方面,我们知道 IDF 表示倒置文档频率,因此 IDF 可以通过 1/DF 重新发送。我们知道DF是语料库级别值,TF是文档级别值。TF-IDF求和将文档级TF转变为语料库级TF。事实上,求和等于这个公式:

计数(单词)/计数(文档包含单词

这种测量可以称为逆散射值。当值上升时意味着单词被收集到较小的文档子集中,反之亦然。

我相信这个公式并没有那么有用。