如何从文本语料库构建 PPMI 矩阵？

Question

如何从文本语料库构建 PPMI 矩阵？

我正在尝试使用 SVD 模型在 Brown 语料库上进行词嵌入。为此，我想首先生成一个词-词共现矩阵，然后转换为 PPMI 矩阵以进行 SVD 矩阵乘法过程。

我尝试使用 SkLearn CountVectorizer 创建共现

count_model = CountVectorizer(ngram_range=(1,1))

X = count_model.fit_transform(corpus)
X[X > 0] = 1
Xc = (X.T * X)
Xc.setdiag(0)
print(Xc.todense())

Run Code Online (Sandbox Code Playgroud)

但：

(1) 不确定如何使用此方法控制上下文窗口？我想尝试不同的上下文大小，看看它对流程有何影响。

(2) 假设 PMI(a, b) = log p(a, b)/p(a)p(b)，如何正确计算 PPMI

任何有关思维过程和实施的帮助将不胜感激！

谢谢（-：

Answer 1

Anw*_*vic 8

我尝试使用提供的代码，但无法对其应用移动窗口。所以，我做了我自己的函数来做到这一点。该函数接受一个句子列表并返回一个pandas.DataFrame表示共现矩阵的对象和一个window_size数字：

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1
    
    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

Run Code Online (Sandbox Code Playgroud)

让我们尝试一下下面两个简单的句子：

>>> text = ["I go to school every day by bus .",
            "i go to theatre every night by bus"]
>>> 
>>> df = co_occurrence(text, 2)
>>> df
         .  bus  by  day  every  go  i  night  school  theatre  to
.        0    1   1    0      0   0  0      0       0        0   0
bus      1    0   2    1      0   0  0      1       0        0   0
by       1    2   0    1      2   0  0      1       0        0   0
day      0    1   1    0      1   0  0      0       1        0   0
every    0    0   2    1      0   0  0      1       1        1   2
go       0    0   0    0      0   0  2      0       1        1   2
i        0    0   0    0      0   2  0      0       0        0   2
night    0    1   1    0      1   0  0      0       0        1   0
school   0    0   0    1      1   1  0      0       0        0   1
theatre  0    0   0    0      1   1  0      1       0        0   1
to       0    0   0    0      2   2  2      0       1        1   0

[11 rows x 11 columns]

Run Code Online (Sandbox Code Playgroud)

现在，我们有了共现矩阵。让我们找到（正）逐点互信息（简称 PPMI）。我使用了斯坦福大学教授 Christopher Potts 在这张幻灯片中找到的代码，可以总结为下图

pmiPPMI与以下内容相同positive=True：

def pmi(df, positive=True):
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    df = df / expected
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        df = np.log(df)
    df[np.isinf(df)] = 0.0  # log(0) = 0
    if positive:
        df[df < 0] = 0.0
    return df

Run Code Online (Sandbox Code Playgroud)

让我们尝试一下：

>>> ppmi = pmi(df, positive=True)
>>> ppmi
                .       bus        by  ...    school   theatre        to
.        0.000000  1.722767  1.386294  ...  0.000000  0.000000  0.000000
bus      1.722767  0.000000  1.163151  ...  0.000000  0.000000  0.000000
by       1.386294  1.163151  0.000000  ...  0.000000  0.000000  0.000000
day      0.000000  1.029619  0.693147  ...  1.252763  0.000000  0.000000
every    0.000000  0.000000  0.693147  ...  0.559616  0.559616  0.559616
go       0.000000  0.000000  0.000000  ...  0.847298  0.847298  0.847298
i        0.000000  0.000000  0.000000  ...  0.000000  0.000000  1.252763
night    0.000000  1.029619  0.693147  ...  0.000000  1.252763  0.000000
school   0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.559616
theatre  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.559616
to       0.000000  0.000000  0.000000  ...  0.559616  0.559616  0.000000

[11 rows x 11 columns]

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，2 月前
查看次数：	5456 次
最近记录：	3 年，1 月前