训练好的 word2vec 模型词汇中缺少单词

Question

训练好的 word2vec 模型词汇中缺少单词

Nay*_*raj 3 python nltk gensim word2vec tensorflow

我目前正在使用 python 使用我提供的句子训练 Word2Vec 模型。然后，我保存并加载模型以获取用于训练模型的句子中每个词的词嵌入。但是，我收到以下错误。

KeyError：“单词'n1985_chicago_bears'不在词汇表中”

而训练期间提供的句子之一如下。

sportsteam n1985_chicago_bears teamplaysincity city chicago

Run Code Online (Sandbox Code Playgroud)

因此，我想知道为什么词汇表中缺少一些单词，尽管接受了该句子语料库中的那些单词的训练。

在自己的语料库上训练 word2vec 模型

import nltk
import numpy as np
from termcolor import colored
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA


#PREPARING DATA

fname = '../data/sentences.txt'

with open(fname) as f:
    content = f.readlines()

# remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]


#TOKENIZING SENTENCES

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

#TRAINING THE WORD2VEC MODEL

model = Word2Vec(sentences)

words = list(model.wv.vocab)
model.wv.save_word2vec_format('model.bin')

Run Code Online (Sandbox Code Playgroud)

来自sentence.txt的例句

sportsteam hawks teamplaysincity city atlanta
stadiumoreventvenue honda_center stadiumlocatedincity city anaheim
sportsteam ducks teamplaysincity city anaheim
sportsteam n1985_chicago_bears teamplaysincity city chicago
stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta
stadiumoreventvenue united_center stadiumlocatedincity city chicago
...

Run Code Online (Sandbox Code Playgroud)

文件中有 1860 行这样的行sentences.txt，每行包含 5 个单词且没有停用词。

保存模型后，我尝试从与保存的相同目录中的不同 python 文件加载它model.bin，如下所示。

加载保存的model.bin

import nltk
import numpy as np
from gensim import models

w = models.KeyedVectors.load_word2vec_format('model.bin', binary=True)
print(w['n1985_chicago_bears'])

Run Code Online (Sandbox Code Playgroud)

但是，我最终遇到以下错误

KeyError: "word 'n1985_chicago_bears' not in vocabulary"

Run Code Online (Sandbox Code Playgroud)

有没有办法使用相同的方法获得训练好的句子语料库中每个单词的词嵌入？

在这方面的任何建议将不胜感激。

Answer 1

muj*_*iga 5

min_count=5gensim 的 Word2Vec 实现的默认值看起来像您正在寻找的标记n1985_chicago_bears在您的语料库中出现的次数少于 5 次。适当更改您的最小计数。

方法签名：

class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha =0.0001，sg=0，hs=0，负=5，ns_exponent=0.75，cbow_mean=1，hashfxn=，iter=5，null_word=0，trim_rule=None，sorted_vocab=1，batch_words=10000，compute_loss=False，回调=（），max_final_vocab=无）

content = [
    "sportsteam hawks teamplaysincity city atlanta",
    "stadiumoreventvenue honda_center stadiumlocatedincity city anaheim",
    "sportsteam ducks teamplaysincity city anaheim",
    "sportsteam n1985_chicago_bears teamplaysincity city chicago",
    "stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta",
    "stadiumoreventvenue united_center stadiumlocatedincity city chicago"
]

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

model = Word2Vec(sentences, min_count=1)
print (model['n1985_chicago_bears'])

Run Code Online (Sandbox Code Playgroud)

正如@gojomo 提到的，不建议使`min_count=1`。处理这个问题的一种方法是用'UNK'替换所有非常低频的词，并使用UNK的向量表示来表示缺失的词。 (2认同)

归档时间：	6 年，9 月前
查看次数：	2924 次
最近记录：	6 年，9 月前