如何使用sklearn计算单词 - 共生矩阵?

new*_*v14 14 python matrix scikit-learn

我正在寻找sklearn中的一个模块,它可以让你得到单词 - 共生矩阵.

我可以得到文档术语矩阵,但不知道如何获得共生词的单词 - 矩阵.

tit*_*ata 22

这是我CountVectorizer在scikit-learn中使用的示例解决方案.参考这篇文章,你可以简单地使用矩阵乘法来得到单词共生矩阵.

from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
        'this cat good',
        'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format
Run Code Online (Sandbox Code Playgroud)

你也可以参考词典count_model,

count_model.vocabulary_
Run Code Online (Sandbox Code Playgroud)

或者,如果您想通过对角线组件进行标准化(参见上一篇文章中的答案).

import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix
Run Code Online (Sandbox Code Playgroud)

另外需要注意的是@Federico Caccia的回答,如果你不希望共同出现在自己的文本中,那么设置的事件大于1到1,例如

X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...
Run Code Online (Sandbox Code Playgroud)


Anw*_*vic 11

所有提供的答案都没有考虑到窗口移动的概念。所以,我做了我自己的函数,通过应用一个定义大小的移动窗口来找到共生矩阵。

这个函数需要一个句子列表和一个window_size数字;它返回一个pandas.DataFrame表示共现矩阵的对象:

from collections import defaultdict

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1
    
    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df
Run Code Online (Sandbox Code Playgroud)

给定以下两个简单的句子,让我们尝试一下:

>>> text = ["I go to school every day by bus .",
            "i go to theatre every night by bus"]
>>> 
>>> df = co_occurrence(text, 2)
>>> df
         .  bus  by  day  every  go  i  night  school  theatre  to
.        0    1   1    0      0   0  0      0       0        0   0
bus      1    0   2    1      0   0  0      1       0        0   0
by       1    2   0    1      2   0  0      1       0        0   0
day      0    1   1    0      1   0  0      0       1        0   0
every    0    0   2    1      0   0  0      1       1        1   2
go       0    0   0    0      0   0  2      0       1        1   2
i        0    0   0    0      0   2  0      0       0        0   2
night    0    1   1    0      1   0  0      0       0        1   0
school   0    0   0    1      1   1  0      0       0        0   1
theatre  0    0   0    0      1   1  0      1       0        0   1
to       0    0   0    0      2   2  2      0       1        1   0

[11 rows x 11 columns]
Run Code Online (Sandbox Code Playgroud)

现在,我们有了共现矩阵。


Gui*_*sch 2

您可以使用或ngram_range中的参数CountVectorizerTfidfVectorizer

代码示例:

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words
Run Code Online (Sandbox Code Playgroud)

如果您想明确说明要计算哪些单词同时出现,请使用参数vocabulary,即:vocabulary = {'awesome unicorns':0, 'batman forever':1}

http://scikit-learn.org/stable/modules/ generated/sklearn.feature_extraction.text.CountVectorizer.html

不言自明且随时可用的代码,具有预定义的词与词共现。awesome unicorns在本例中,我们正在跟踪和 的同时出现batman forever

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())
Run Code Online (Sandbox Code Playgroud)

最终输出是,它与我们提供的数据('awesome unicorns', 1), ('batman forever', 2)完全对应。samples