gla*_*313 11 python word2vec tensorflow
我是tensorflow和word2vec的新手.我刚研究了word2vec_basic.py,它使用Skip-Gram算法训练模型.现在我想训练使用CBOW算法.如果我简单地逆转train_inputs和,这是否可以实现train_labels?
Zic*_*ang 14
我认为CBOW模型不能简单地通过翻转train_inputs和来实现train_labels,Skip-gram因为CBOW模型体系结构使用周围单词的向量之和作为分类器预测的单个实例.例如,你应该[the, brown]一起使用来预测quick而不是the用来预测quick.
要实现CBOW,您必须编写新的generate_batch生成器函数,并在应用逻辑回归之前总结周围单词的向量.我写了一个你可以参考的例子:https://github.com/wangz10/tensorflow-playground/blob/master/word2vec.py#L105
对于CBOW,您只需要更改代码word2vec_basic.py的几个部分.总的来说,训练结构和方法是相同的.
我应该在word2vec_basic.py中更改哪些部分?
1)它生成训练数据对的方式.因为在CBOW中,您正在预测中心词,而不是上下文词.
新版本generate_batch将是
def generate_batch(batch_size, bag_window):
global data_index
span = 2 * bag_window + 1 # [ bag_window target bag_window ]
batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size):
# just for testing
buffer_list = list(buffer)
labels[i, 0] = buffer_list.pop(bag_window)
batch[i] = buffer_list
# iterate to the next buffer
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels
Run Code Online (Sandbox Code Playgroud)
那么CBOW的新训练数据就是
data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the']
#with bag_window = 1:
batch: [['anarchism', 'as'], ['originated', 'a'], ['as', 'term'], ['a', 'of']]
labels: ['originated', 'as', 'a', 'term']
Run Code Online (Sandbox Code Playgroud)
与Skip-gram的数据相比
#with num_skips = 2 and skip_window = 1:
batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term', 'of', 'of', 'abuse', 'abuse', 'first', 'first', 'used', 'used']
labels: ['as', 'anarchism', 'originated', 'a', 'term', 'as', 'a', 'of', 'term', 'abuse', 'of', 'first', 'used', 'abuse', 'against', 'first']
Run Code Online (Sandbox Code Playgroud)
2)因此您还需要更改变量形状
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
Run Code Online (Sandbox Code Playgroud)
至
train_dataset = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])
Run Code Online (Sandbox Code Playgroud)
3)损失功能
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(
weights = softmax_weights, biases = softmax_biases, inputs = tf.reduce_sum(embed, 1), labels = train_labels, num_sampled= num_sampled, num_classes= vocabulary_size))
Run Code Online (Sandbox Code Playgroud)
注意输入= tf.reduce_sum(embed,1),正如Zichen Wang所提到的那样.
就是这个!
基本上,是的:
对于给定的文本the quick brown fox jumped over the lazy dog:,窗口大小为1的CBOW实例将是
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
Run Code Online (Sandbox Code Playgroud)