如何使用LSTM构建语言模型,该模型指定给定句子的出现概率

Swa*_*amy 5 machine-learning neural-network deep-learning keras

目前,我正在使用Trigram执行此操作.它指定给定句子的发生概率.但它仅限于2个单词的唯一背景.但是LSTM可以做得更多.那么如何构建一个LSTM模型来分配给定句子的发生概率呢?

rvi*_*nas 11

我刚编写了一个非常简单的例子,展示了如何使用LSTM模型计算句子出现的概率.完整的代码可以在这里找到.

假设我们想要预测下一个数据集的句子出现概率(这个押韵发表于1765年左右伦敦的Mother Goose's Melody):

# Data
data = ["Two little dicky birds",
        "Sat on a wall,",
        "One called Peter,",
        "One called Paul.",
        "Fly away, Peter,",
        "Fly away, Paul!",
        "Come back, Peter,",
        "Come back, Paul."]
Run Code Online (Sandbox Code Playgroud)

首先,让我们使用keras.preprocessing.text.Tokenizer创建一个词汇表并对句子进行标记:

# Preprocess data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
vocab = tokenizer.word_index
seqs = tokenizer.texts_to_sequences(data)
Run Code Online (Sandbox Code Playgroud)

我们的模型将一系列单词作为输入(上下文),并将输出给定上下文的词汇表中每个单词的条件概率分布.为此,我们通过填充序列并在其上滑动窗口来准备训练数据:

def prepare_sentence(seq, maxlen):
    # Pads seq and slides windows
    x = []
    y = []
    for i, w in enumerate(seq):
        x_padded = pad_sequences([seq[:i]],
                                 maxlen=maxlen - 1,
                                 padding='pre')[0]  # Pads before each sequence
        x.append(x_padded)
        y.append(w)
    return x, y

# Pad sequences and slide windows
maxlen = max([len(seq) for seq in seqs])
x = []
y = []
for seq in seqs:
    x_windows, y_windows = prepare_sentence(seq, maxlen)
    x += x_windows
    y += y_windows
x = np.array(x)
y = np.array(y) - 1  # The word <PAD> does not constitute a class
y = np.eye(len(vocab))[y]  # One hot encoding
Run Code Online (Sandbox Code Playgroud)

我决定为每节经文单独滑动窗口,但这可以用不同的方式完成.

接下来,我们使用Keras定义并训练一个简单的LSTM模型.该模型由嵌入层,LSTM层和具有softmax激活的密集层组成(其使用LSTM的最后时间步的输出来产生给定上下文的词汇表中每个单词的概率):

# Define model
model = Sequential()
model.add(Embedding(input_dim=len(vocab) + 1,  # vocabulary size. Adding an
                                               # extra element for <PAD> word
                    output_dim=5,  # size of embeddings
                    input_length=maxlen - 1))  # length of the padded sequences
model.add(LSTM(10))
model.add(Dense(len(vocab), activation='softmax'))
model.compile('rmsprop', 'categorical_crossentropy')

# Train network
model.fit(x, y, epochs=1000)
Run Code Online (Sandbox Code Playgroud)

可以使用条件概率规则来计算P(w_1, ..., w_n)句子出现的联合w_1 ... w_n概率:

P(w_1, ..., w_n)=P(w_1)*P(w_2|w_1)*...*P(w_n|w_{n-1}, ..., w_1)

其中每个条件概率由LSTM模型给出.请注意,它们可能非常小,因此在日志空间中进行合理的工作是为了避免数值不稳定性问题.把它们放在一起:

# Compute probability of occurence of a sentence
sentence = "One called Peter,"
tok = tokenizer.texts_to_sequences([sentence])[0]
x_test, y_test = prepare_sentence(tok, maxlen)
x_test = np.array(x_test)
y_test = np.array(y_test) - 1  # The word <PAD> does not constitute a class
p_pred = model.predict(x_test)  # array of conditional probabilities
vocab_inv = {v: k for k, v in vocab.items()}

# Compute product
# Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))
log_p_sentence = 0
for i, prob in enumerate(p_pred):
    word = vocab_inv[y_test[i]+1]  # Index 0 from vocab is reserved to <PAD>
    history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])
    prob_word = prob[y_test[i]]
    log_p_sentence += np.log(prob_word)
    print('P(w={}|h={})={}'.format(word, history, prob_word))
print('Prob. sentence: {}'.format(np.exp(log_p_sentence)))
Run Code Online (Sandbox Code Playgroud)

注意:这是一个非常小的玩具数据集,我们可能过度拟合.

  • “y_i”是句子“x_i”之后的单词。因此,我使用 (`x_i`, `y_i`) 对来训练 LSTM 模型,因为我想对概率 P(`y_i` | `x_i`) 进行建模。 (2认同)