了解 Keras 中语音识别的 CTC 损失

Question

了解 Keras 中语音识别的 CTC 损失

Bap*_*ier 6 python deep-learning keras tensorflow ctc

我试图了解 CTC 损失如何用于语音识别以及它如何在 Keras 中实现。

我想我明白了什么（如果我错了，请纠正我！）

大体上，CTC 损失被添加到经典网络之上，以便逐个元素（文本或语音的逐个字母）解码顺序信息元素，而不是直接直接解码元素块（例如单词）。

假设我们正在将某些句子的话语作为 MFCC 来提供。

使用 CTC-loss 的目标是学习如何使每个字母在每个时间步与 MFCC 匹配。因此，Dense+softmax 输出层由与组成句子所需的元素数量一样多的神经元组成：

字母 (a, b, ..., z)
空白标记 (-)
一个空格 (_) 和一个结束字符 (>)

然后，softmax 层有 29 个神经元（26 个用于字母表 + 一些特殊字符）。

为了实现它，我发现我可以做这样的事情：

# CTC implementation from Keras example found at https://github.com/keras- 
# team/keras/blob/master/examples/image_ocr.py

def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    # the 2 is critical here since the first couple outputs of the RNN
    # tend to be garbage:
    # print "y_pred_shape: ", y_pred.shape
    y_pred = y_pred[:, 2:, :]
    # print "y_pred_shape: ", y_pred.shape
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)



input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)

x = Bidirectional(lstm(...,return_sequences=True))(input_data)

x = Bidirectional(lstm(...,return_sequences=True))(x)

y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)

loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
                  [y_pred, y_true, input_length, label_length])

model = Model(inputs=[input_data, y_true, input_length,label_length], 
                      outputs=loss_out)

Run Code Online (Sandbox Code Playgroud)

ALPHABET_LENGTH = 29（字母长度 + 特殊字符）

和：

y_true：包含真实标签的张量（样本，max_string_length）。
y_pred : 张量 (samples, time_steps, num_categories) 包含 softmax 的预测或输出。
input_length：张量 (samples, 1) 包含 y_pred 中每个批次项目的序列长度。
label_length：张量 (samples, 1) 包含 y_true 中每个批次项目的序列长度。

（来源）

现在，我面临一些问题：

我不明白的
- 这种植入是编码和使用 CTC 损失的正确方法吗？
- 我不明白具体y_true，input_length和 label_length 是什么。有什么例子吗？
- 我应该以什么形式给网络贴标签？再次，任何例子？

Answer 1

Dan*_*ler 10

这些是什么？

y_true你的真实数据。您将要与训练中模型的输出进行比较的数据。（另一方面，y_pred是模型的计算输出）
input_length中，长度（在步骤或字符这种情况下）在每个样品（句子）的y_pred张量（作为所述此处）
label_length，y_true（或标签）张量中每个样本（句子）的长度（在这种情况下以步长或字符为单位）。

似乎这种损失期望您的模型的输出 ( y_pred) 具有不同的长度，以及您的地面实况数据 ( y_true)。这可能是为了避免在句子结束后计算垃圾字符的损失（因为您将需要一个固定大小的张量来同时处理大量句子）

标签形式：

由于函数的文档要求 shape (samples, length)，格式是...每个句子中每个字符的字符索引。

这个怎么用？

有一些可能性。

1-如果你不关心长度：

如果所有长度都相同，您可以轻松地将其用作常规损失：

def ctc_loss(y_true, y_pred):

    return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    #where input_length and label_length are constants you created previously
    #the easiest way here is to have a fixed batch size in training 
    #the lengths should have the same batch size (see shapes in the link for ctc_cost)    

model.compile(loss=ctc_loss, ...)   

#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)

Run Code Online (Sandbox Code Playgroud)

2 - 如果你关心长度。

这有点复杂，您需要您的模型以某种方式告诉您每个输出句子的长度。
这样做还有几种创造性的形式：

有一个“end_of_sentence”字符并检测它在句子中的位置。
让模型的一个分支来计算这个数字并将其四舍五入为整数。
（硬核）如果您使用有状态的手动训练循环，请获取您决定完成一个句子的迭代的索引

我喜欢第一个想法，并将在这里举例说明。

def ctc_find_eos(y_true, y_pred):

    #convert y_pred from one-hot to label indices
    y_pred_ind = K.argmax(y_pred, axis=-1)

    #to make sure y_pred has one end_of_sentence (to avoid errors)
    y_pred_end = K.concatenate([
                                  y_pred_ind[:,:-1], 
                                  eos_index * K.ones_like(y_pred_ind[:,-1:])
                               ], axis = 1)

    #to make sure the first occurrence of the char is more important than subsequent ones
    occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())

    #is eos?
    is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
    is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))

    #lengths
    true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
    pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)

    #reshape
    true_lengths = K.reshape(true_lengths, (-1,1))
    pred_lengths = K.reshape(pred_lengths, (-1,1))

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

model.compile(loss=ctc_find_eos, ....)

Run Code Online (Sandbox Code Playgroud)

如果您使用其他选项，请使用模型分支来计算长度，将这些长度连接到输出的第一步或最后一步，并确保对地面实况数据中的真实长度执行相同的操作。然后，在损失函数中，只取长度部分：

def ctc_concatenated_length(y_true, y_pred):

    #assuming you concatenated the length in the first step
    true_lengths = y_true[:,:1] #may need to cast to int
    y_true = y_true[:, 1:]

    #since y_pred uses one-hot, you will need to concatenate to full size of the last axis, 
    #thus the 0 here
    pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
    y_pred = y_pred[:, 1:]

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，3 月前
查看次数：	5667 次
最近记录：	4 年，4 月前