将 MFCC 频谱图的输入转换为 CNN（音频识别）

Question

将 MFCC 频谱图的输入转换为 CNN（音频识别）

Rub*_*rto 5 python speech-recognition conv-neural-network keras tensorflow

我有一个音频数据集，并且我已经将这些音频介绍 MFCC 绘图转换为如下所示：

现在我想喂养我的神经网络

import tensorflow as tf
import tensorflow.keras as tfk
import tensorflow.keras.layers as tfkl

cnn_model = tfk.Sequential(name='CNN_model')
cnn_model.add(tfkl.Conv1D(filters= 225, kernel_size= 11, padding='same', activation='relu', input_shape=(4500,9000, 3)))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.Bidirectional(tfkl.GRU(200, activation='relu', return_sequences=True, implementation=0)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())
cnn_model.compile(loss='mae', optimizer='Adam', metrics=['mae'])

cnn_model.summary()

Run Code Online (Sandbox Code Playgroud)

我使用 Conv1D 因为它是此类神经网络中使用的层。但我不知道如何将数据从图像转换为 CNN 的输入。我自己尝试过几次改造，但都没有成功。

正如您在下图中看到的，我需要提供第一层，但Conv1D我不能，因为我的图像的形状是(4500, 9000, 3)。所以基本上，我想做的是以Conv1D与下图中相同的方式在输入中转换该图像。

该图像代表传递到 NN 的 1 个音频。

显然，当我将具有这种形状的图像传递到图层时Conv1D，我有一个ValueError ValueError: Input 0 of layer conv1d_4 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [None, 4500, 9000, 3]

我将图像转换为灰度，但这不是方法，我丢失了有价值的信息。

Answer 1

thu*_*v89 5

我觉得您并没有将其视为典型的语音识别问题。因为我在你的方法中发现了几个奇怪的选择。

我注意到的问题

MFCC 操作的输出形状。

如果您查看librosa.feature.mfcc，就会发现它是这样的，

返回：M:np.ndarray [shape=(n_mfcc, t)]

正如您所看到的，这里没有频道。有输入维度（n_mfcc）和时间维度（t）。Conv1D因此，您应该无需任何预处理就可以直接使用。

SoftMax 之前的 Dropout

这就是你的算法的尾部的样子，

cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())

Run Code Online (Sandbox Code Playgroud)

就我个人而言，我还没有使用过在最后一层使用 dropout 的人。所以我会摆脱它。因为 dropout 会随机切换神经元。但您希望所有输出节点随时打开。

损失函数

通常，CTC用于优化语音识别模型。我（就我个人而言）还没有看到有人使用mae语音模型作为损失。因为，您的输入数据和标签数据通常具有不对齐的时间维度。这意味着，并不总是有一个与预测的每个时间步相对应的标签。这就是 CTC 损失的亮点。这可能就是您想要用于此模型的内容（除非您 100% 确定每个预测都有一个标签并且它们完全对齐）。

话虽如此，损失取决于你要解决的问题。但我将提供一个示例来说明如何使用此损失来解决此问题。

一个工作示例

数据集

为了展示一个有效的示例，我将使用此语音数据集。我选择这个是因为，由于问题简单，我可以很快得到好的结果。

输入：音频
输出：标签0-9

MFCC改造

然后你可以对音频文件执行MFCC，你将得到以下热图。正如我之前所说，这将是一个二维矩阵(n_mfcc, timesteps)大小的数组。有了批量维度，它就变成了(batch size, n_mfcc, timesteps)。

以下是如何可视化上述内容。这里，y是通过函数加载的音频librosa.core.load()。

y = audios[aid][1][0]
sr = audios[aid][1][1]
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
print(mfcc.shape)

plt.figure(figsize=(6, 4))
librosa.display.specshow(mfcc, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.tight_layout()

Run Code Online (Sandbox Code Playgroud)

创建训练/测试数据

接下来您可以创建训练和测试数据。这是我创建的。

train_data -(sample size, timesteps, n_mfcc)大小数组
train_labels =(sample size, timesteps, num_classes)大小数组
train_inp_lengths - A (sample size,)` 大小数组（用于 CTC 损失）
train_seq_lengths - A (sample size,)` 大小数组（用于 CTC 损失）
test_data -(sample size, timesteps, n_mfcc)大小数组
test_labels =(sample size, timesteps, num_classes+1)大小数组
test_inp_lengths - A (sample size,)` 大小数组（用于 CTC 损失）
test_seq_lengths - A (sample size,)` 大小数组（用于 CTC 损失）

我正在使用以下映射将字符转换为数字

alphabet = 'abcdefghijklmnopqrstuvwxyz '
a_map = {} # map letter to number
rev_a_map = {} # map number to letter
for i, a in enumerate(alphabet):
  a_map[a] = i
  rev_a_map[i] = a

label_map = {0:'zero', 1:'one', 2:'two', 3:'three', 4:'four', 5:'five', 6:'six', 7: 'seven', 8: 'eight', 9:'nine'}

Run Code Online (Sandbox Code Playgroud)

需要注意的事情很少。

请注意，mfcc操作返回(n_mfcc, time)。您必须进行轴排列才能使其(time, n_mfcc)格式化。使得卷积发生在时间维度上。
我还必须确保标签具有与输入完全相同的时间步数（这对于 ctc_loss 来说不是必需的）。但这是 keras 模型定义强制执行的要求。这是通过在每个字符序列的末尾添加空格来完成的。

定义模型

我已从顺序 API 更改为函数式 API，因为我需要包含多个输入层才能使其适用于ctc_loss. 此外，我还去掉了最后一个 dropout 层。

def ctc_loss(inp_lengths, seq_lengths):
    def loss(y_true, y_pred):
        l = tf.reduce_mean(K.ctc_batch_cost(tf.argmax(y_true, axis=-1), y_pred, inp_lengths, seq_lengths))        
        return l            
    return loss

K.clear_session()
inp = tfk.Input(shape=(10,50))
inp_len = tfk.Input(shape=(1))
seq_len = tfk.Input(shape=(1))
out = tfkl.Conv1D(filters= 128, kernel_size= 5, padding='same', activation='relu')(inp)
out = tfkl.BatchNormalization()(out)
out = tfkl.Bidirectional(tfkl.GRU(128, return_sequences=True, implementation=0))(out)
out = tfkl.Dropout(0.2)(out)
out = tfkl.BatchNormalization()(out)
out = tfkl.TimeDistributed(tfkl.Dense(27, activation='softmax'))(out)
cnn_model = tfk.models.Model(inputs=[inp, inp_len, seq_len], outputs=out)
cnn_model.compile(loss=ctc_loss(inp_lengths=inp_len , seq_lengths=seq_len), optimizer='Adam', metrics=['mae'])

Run Code Online (Sandbox Code Playgroud)

训练模型

然后你只需调用，

cnn_model.fit([train_data, train_inp_lengths, train_seq_lengths], train_labels, batch_size=64, epochs=20)

Run Code Online (Sandbox Code Playgroud)

这给了，

Train on 900 samples
Epoch 1/20
900/900 [==============================] - 3s 3ms/sample - loss: 11.4955 - mean_absolute_error: 0.0442
Epoch 2/20
900/900 [==============================] - 2s 2ms/sample - loss: 4.1317 - mean_absolute_error: 0.0340
...
Epoch 19/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1162 - mean_absolute_error: 0.0275
Epoch 20/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1012 - mean_absolute_error: 0.0277

Run Code Online (Sandbox Code Playgroud)

使用模型进行预测

y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])

n_ids = 5

for pred, true in zip(y[:n_ids,:,:], test_labels[:n_ids,:,:]):
  pred_ids = np.argmax(pred,axis=-1)
  true_ids = np.argmax(true, axis=-1)
  print('pred > ',[rev_a_map[tid] for tid in pred_ids])
  print('true > ',[rev_a_map[tid] for tid in true_ids])

Run Code Online (Sandbox Code Playgroud)

这给出了

pred >  ['e', ' ', 'i', 'i', 'i', 'g', 'h', ' ', ' ', 't']
true >  ['e', 'i', 'g', 'h', 't', ' ', ' ', ' ', ' ', ' ']

pred >  ['o', ' ', ' ', 'n', 'e', ' ', ' ', ' ', ' ', ' ']
true >  ['o', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['s', 'e', ' ', ' ', ' ', ' ', ' ', ' ', 'v', 'e']
true >  ['s', 'e', 'v', 'e', 'n', ' ', ' ', ' ', ' ', ' ']

pred >  ['z', 'e', ' ', ' ', ' ', ' ', ' ', 'r', 'o', ' ']
true >  ['z', 'e', 'r', 'o', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['n', ' ', ' ', 'i', 'i', 'n', 'e', ' ', ' ', ' ']
true >  ['n', 'i', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ']

Run Code Online (Sandbox Code Playgroud)

要消除重复的字母和中间的空格，请使用ctc_decode如下函数。

y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])

sess = K.get_session()
pred = sess.run(tf.keras.backend.ctc_decode(y, test_inp_lengths[:,0]))

rev_a_map[-1] = '-'

for pred, true in zip(pred[0][0][:n_ids,:], test_labels[:n_ids,:,:]):
  print(pred.shape)  
  true_ids = np.argmax(true, axis=-1)
  print('pred > ',[rev_a_map[tid] for tid in pred])
  print('true > ',[rev_a_map[tid] for tid in true_ids])

Run Code Online (Sandbox Code Playgroud)

这给了，

pred >  ['e', 'i', 'g', 'h', 't']
true >  ['e', 'i', 'g', 'h', 't', ' ', ' ', ' ', ' ', ' ']

pred >  ['o', 'n', 'e', '-', '-']
true >  ['o', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['s', 'e', 'i', 'v', 'n']
true >  ['s', 'e', 'v', 'e', 'n', ' ', ' ', ' ', ' ', ' ']

pred >  ['z', 'e', 'r', 'o', '-']
true >  ['z', 'e', 'r', 'o', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['n', 'i', 'n', 'e', '-']
true >  ['n', 'i', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ']

Run Code Online (Sandbox Code Playgroud)

请注意，我添加了一个新标签-1。这是函数添加的用于表示任何空白的内容ctc_decode。

归档时间：	5 年，11 月前
查看次数：	4808 次
最近记录：	5 年，11 月前