MFCC Python：与 librosa vs python_speech_features vs tensorflow.signal 完全不同的结果

Question

MFCC Python：与 librosa vs python_speech_features vs tensorflow.signal 完全不同的结果

Yil*_*ang 7 python audio mfcc tensorflow librosa

我正在尝试从音频（.wav 文件）中提取 MFCC 特征，我已经尝试过python_speech_features，librosa但它们给出了完全不同的结果：

audio, sr = librosa.load(file, sr=None)

# librosa
hop_length = int(sr/100)
n_fft = int(sr/40)
features_librosa = librosa.feature.mfcc(audio, sr, n_mfcc=13, hop_length=hop_length, n_fft=n_fft)

# psf
features_psf = mfcc(audio, sr, numcep=13, winlen=0.025, winstep=0.01)

Run Code Online (Sandbox Code Playgroud)

以下是情节：

图书馆：

python_speech_features：

我是否为这两种方法传递了错误的参数？为什么这里有这么大的差别？

更新：我也尝试过 tensorflow.signal 实现，结果如下：

情节本身更接近 librosa 的情节，但规模更接近 python_speech_features。（请注意，这里我计算了 80 个 mel bin 并取了前 13 个；如果我只使用 13 个 bin 进行计算，结果看起来也大不相同）。代码如下：

stfts = tf.signal.stft(audio, frame_length=n_fft, frame_step=hop_length, fft_length=512)
spectrograms = tf.abs(stfts)

num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
    num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz, upper_edge_hertz)
mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
features_tf = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :13]
features_tf = np.array(features_tf).T

Run Code Online (Sandbox Code Playgroud)

我想我的问题是：哪个输出更接近 MFCC 的实际外观？

Answer 1

Luk*_*ski 15

这里至少有两个因素可以解释为什么会得到不同的结果：

梅尔尺度没有单一的定义。Librosa实现两种方式：Slaney和HTK。其他包可能会使用不同的定义，导致不同的结果。话虽如此，总体情况应该是相似的。这就引出了第二个问题……
python_speech_features默认情况下，看跌期权能源作为第一个（索引零）系数（appendEnergy为True默认值），也就是说，当你问的，例如13 MFCC，你得到有效的12 + 1。

换句话说，您不是在比较 13librosa对 13python_speech_features系数，而是 13 对 12。能量可能具有不同的量级，因此由于不同的色标会产生完全不同的图像。

我现在将演示两个模块如何产生类似的结果：

import librosa
import python_speech_features
import matplotlib.pyplot as plt
from scipy.signal.windows import hann
import seaborn as sns

n_mfcc = 13
n_mels = 40
n_fft = 512 
hop_length = 160
fmin = 0
fmax = None
sr = 16000
y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)

mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                    n_mfcc=n_mfcc, n_mels=n_mels,
                                    hop_length=hop_length,
                                    fmin=fmin, fmax=fmax, htk=False)

mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                          numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                          preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)

Run Code Online (Sandbox Code Playgroud)

如您所见，比例不同，但整体画面看起来非常相似。请注意，我必须确保传递给模块的许多参数是相同的。

Answer 2

Nov*_*vak 9

这就是让我彻夜难眠的事情。这个答案是正确的（并且非常有用！）但并不完整，因为它不能解释两种方法之间的巨大差异。我的回答增加了一个重要的额外细节，但仍然没有实现完全匹配。

发生的事情很复杂，最好用下面的冗长代码块来解释，该代码块librosa与python_speech_features另一个包torchaudio.

首先，请注意 torchaudio 的实现有一个参数，log_mels其默认值 (False) 模仿 librosa 实现，但如果设置为 True 将模仿 python_speech_features。在这两种情况下，结果仍然不准确，但相似之处很明显。
其次，如果您深入研究 torchaudio 的实现代码，您会看到默认值不是“教科书实现”（torchaudio 的话，但我相信它们），而是为了与 Librosa 兼容而提供的注释；torchaudio 中从一个切换到另一个的关键操作是：

    mel_specgram = self.MelSpectrogram(waveform)
    if self.log_mels:
        log_offset = 1e-6
        mel_specgram = torch.log(mel_specgram + log_offset)
    else:
        mel_specgram = self.amplitude_to_DB(mel_specgram)

Run Code Online (Sandbox Code Playgroud)

第三，您会很合理地想知道是否可以强制 librosa 正确运行。答案是肯定的（或者至少“看起来像”），方法是直接获取 mel 频谱图，获取它的自然对数，然后使用它而不是原始样本作为 librosa mfcc 函数的输入。有关详细信息，请参阅下面的代码。
最后，请注意，如果您使用此代码，请务必检查查看不同功能时会发生什么。第 0 个特征仍然有严重的无法解释的偏移，更高的特征往往会相互偏离。这可能就像引擎盖下的不同实现或略微不同的数值稳定性常数一样简单，或者它可能是可以通过微调修复的东西，例如填充的选择或某处分贝转换中的参考。我真的不知道。

下面是一些示例代码：

import librosa
import python_speech_features
import matplotlib.pyplot as plt
from scipy.signal.windows import hann
import torchaudio.transforms
import torch

n_mfcc = 13
n_mels = 40
n_fft = 512 
hop_length = 160
fmin = 0
fmax = None
sr = 16000

melkwargs={"n_fft" : n_fft, "n_mels" : n_mels, "hop_length":hop_length, "f_min" : fmin, "f_max" : fmax}

y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)

# Default librosa with db mel scale 
mfcc_lib_db = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                    n_mfcc=n_mfcc, n_mels=n_mels,
                                    hop_length=hop_length,
                                    fmin=fmin, fmax=fmax, htk=False)

# Nearly identical to above
# mfcc_lib_db = librosa.feature.mfcc(S=librosa.power_to_db(S), n_mfcc=n_mfcc, htk=False)

# Modified librosa with log mel scale (helper)
S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, fmin=fmin, 
                                    fmax=fmax, hop_length=hop_length)

# Modified librosa with log mel scale
mfcc_lib_log = librosa.feature.mfcc(S=np.log(S+1e-6), n_mfcc=n_mfcc, htk=False)

# Python_speech_features 
mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                          numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                          preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)

# Torchaudio 'textbook' log mel scale 
mfcc_torch_log = torchaudio.transforms.MFCC(sample_rate=sr, n_mfcc=n_mfcc, 
                                            dct_type=2, norm='ortho', log_mels=True, 
                                            melkwargs=melkwargs)(torch.from_numpy(y))

# Torchaudio 'librosa compatible' default dB mel scale 
mfcc_torch_db = torchaudio.transforms.MFCC(sample_rate=sr, n_mfcc=n_mfcc, 
                                           dct_type=2, norm='ortho', log_mels=False, 
                                           melkwargs=melkwargs)(torch.from_numpy(y))

feature = 1 # <-------- Play with this!!
plt.subplot(2, 1, 1)

plt.plot(mfcc_lib_log.T[:,feature], 'k')
plt.plot(mfcc_lib_db.T[:,feature], 'b')
plt.plot(mfcc_speech[:,feature], 'r')
plt.plot(mfcc_torch_log.T[:,feature], 'c')
plt.plot(mfcc_torch_db.T[:,feature], 'g')
plt.grid()

plt.subplot(2, 2, 3)
plt.plot(mfcc_lib_log.T[:,feature], 'k')
plt.plot(mfcc_torch_log.T[:,feature], 'c')
plt.plot(mfcc_speech[:,feature], 'r')
plt.grid()

plt.subplot(2, 2, 4)
plt.plot(mfcc_lib_db.T[:,feature], 'b')
plt.plot(mfcc_torch_db.T[:,feature], 'g')
plt.grid()

Run Code Online (Sandbox Code Playgroud)

老实说，这些实现都不能令人满意：

Python_speech_features 采用了一种莫名其妙的方法，用能量代替第 0 个特征而不是用它来增强，并且没有常用的 delta 实现
默认情况下，Librosa 是非标准的，没有警告，并且缺乏明显的能量增强方法，但在库的其他地方具有非常强大的 delta 函数。
Torchaudio 可以效仿，也有一个多功能的 delta 功能，但仍然没有干净，明显的方式来获取能量。

归档时间：	5 年，9 月前
查看次数：	7326 次
最近记录：	5 年，8 月前