dje*_*an9 4 python nlp word-embedding
我正在尝试使用 SVD 模型在 Brown 语料库上进行词嵌入。为此,我想首先生成一个词-词共现矩阵,然后转换为 PPMI 矩阵以进行 SVD 矩阵乘法过程。
我尝试使用 SkLearn CountVectorizer 创建共现
count_model = CountVectorizer(ngram_range=(1,1))
X = count_model.fit_transform(corpus)
X[X > 0] = 1
Xc = (X.T * X)
Xc.setdiag(0)
print(Xc.todense())
Run Code Online (Sandbox Code Playgroud)
但:
(1) 不确定如何使用此方法控制上下文窗口?我想尝试不同的上下文大小,看看它对流程有何影响。
(2) 假设 PMI(a, b) = log p(a, b)/p(a)p(b),如何正确计算 PPMI
任何有关思维过程和实施的帮助将不胜感激!
谢谢 (-:
我尝试使用提供的代码,但无法对其应用移动窗口。所以,我做了我自己的函数来做到这一点。该函数接受一个句子列表并返回一个pandas.DataFrame表示共现矩阵的对象和一个window_size数字:
def co_occurrence(sentences, window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t, token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
index=vocab,
columns=vocab)
for key, value in d.items():
df.at[key[0], key[1]] = value
df.at[key[1], key[0]] = value
return df
Run Code Online (Sandbox Code Playgroud)
让我们尝试一下下面两个简单的句子:
>>> text = ["I go to school every day by bus .",
"i go to theatre every night by bus"]
>>>
>>> df = co_occurrence(text, 2)
>>> df
. bus by day every go i night school theatre to
. 0 1 1 0 0 0 0 0 0 0 0
bus 1 0 2 1 0 0 0 1 0 0 0
by 1 2 0 1 2 0 0 1 0 0 0
day 0 1 1 0 1 0 0 0 1 0 0
every 0 0 2 1 0 0 0 1 1 1 2
go 0 0 0 0 0 0 2 0 1 1 2
i 0 0 0 0 0 2 0 0 0 0 2
night 0 1 1 0 1 0 0 0 0 1 0
school 0 0 0 1 1 1 0 0 0 0 1
theatre 0 0 0 0 1 1 0 1 0 0 1
to 0 0 0 0 2 2 2 0 1 1 0
[11 rows x 11 columns]
Run Code Online (Sandbox Code Playgroud)
现在,我们有了共现矩阵。让我们找到(正)逐点互信息(简称 PPMI)。我使用了斯坦福大学教授 Christopher Potts 在这张幻灯片中找到的代码,可以总结为下图
pmiPPMI与以下内容相同positive=True:
def pmi(df, positive=True):
col_totals = df.sum(axis=0)
total = col_totals.sum()
row_totals = df.sum(axis=1)
expected = np.outer(row_totals, col_totals) / total
df = df / expected
# Silence distracting warnings about log(0):
with np.errstate(divide='ignore'):
df = np.log(df)
df[np.isinf(df)] = 0.0 # log(0) = 0
if positive:
df[df < 0] = 0.0
return df
Run Code Online (Sandbox Code Playgroud)
让我们尝试一下:
>>> ppmi = pmi(df, positive=True)
>>> ppmi
. bus by ... school theatre to
. 0.000000 1.722767 1.386294 ... 0.000000 0.000000 0.000000
bus 1.722767 0.000000 1.163151 ... 0.000000 0.000000 0.000000
by 1.386294 1.163151 0.000000 ... 0.000000 0.000000 0.000000
day 0.000000 1.029619 0.693147 ... 1.252763 0.000000 0.000000
every 0.000000 0.000000 0.693147 ... 0.559616 0.559616 0.559616
go 0.000000 0.000000 0.000000 ... 0.847298 0.847298 0.847298
i 0.000000 0.000000 0.000000 ... 0.000000 0.000000 1.252763
night 0.000000 1.029619 0.693147 ... 0.000000 1.252763 0.000000
school 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.559616
theatre 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.559616
to 0.000000 0.000000 0.000000 ... 0.559616 0.559616 0.000000
[11 rows x 11 columns]
Run Code Online (Sandbox Code Playgroud)