录制的一个音符的音频会产生多个开始时间

Question

录制的一个音符的音频会产生多个开始时间

pav*_*163 8 python signal-processing pitch-tracking librosa onset-detection

我使用Librosa库进行音高和起始检测.具体来说,我正在使用onset_detect和piptrack.

这是我的代码:

def detect_pitch(y, sr, onset_offset=5, fmin=75, fmax=1400):
  y = highpass_filter(y, sr)

  onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
  pitches, magnitudes = librosa.piptrack(y=y, sr=sr, fmin=fmin, fmax=fmax)

  notes = []

  for i in range(0, len(onset_frames)):
    onset = onset_frames[i] + onset_offset
    index = magnitudes[:, onset].argmax()
    pitch = pitches[index, onset]
    if (pitch != 0):
      notes.append(librosa.hz_to_note(pitch))

  return notes

def highpass_filter(y, sr):
  filter_stop_freq = 70  # Hz
  filter_pass_freq = 100  # Hz
  filter_order = 1001

  # High-pass filter
  nyquist_rate = sr / 2.
  desired = (0, 0, 1, 1)
  bands = (0, filter_stop_freq, filter_pass_freq, nyquist_rate)
  filter_coefs = signal.firls(filter_order, bands, desired, nyq=nyquist_rate)

  # Apply high-pass filter
  filtered_audio = signal.filtfilt(filter_coefs, [1], y)
  return filtered_audio

Run Code Online (Sandbox Code Playgroud)

当在工作室录制的吉他音频样本上运行时,因此没有噪声的样本(像这样),我在两个功能中都获得了非常好的结果.起始时间是正确的,并且频率几乎总是正确的(有时会出现一些八度音程误差).

然而,当我尝试使用便宜的麦克风录制自己的吉他声时,会出现一个大问题.我得到带有噪音的音频文件,比如这个.该onset_detect算法混淆并认为噪声包含起始时间.因此,我得到了非常糟糕的结果.即使我的音频文件包含一个音符,我也会获得很多开始时间.

这是两个波形.第一个是录制在录音棚中的B3音符的吉他样本,而第二个是我录制的E2音符.

第一个的结果是正确的B3(检测到一个开始时间).第二个结果是7个元素的数组,这意味着检测到7个开始时间,而不是1!其中一个元素是正确的起始时间,其他元素只是噪声部分的随机峰值.

另一个例子是这个包含音符B3,C4,D4,E4的音频文件:

如您所见,噪声很明显,我的高通滤波器没有帮助(这是应用滤波器后的波形).

我认为这是一个噪音问题,因为那些文件之间存在差异.如果是,我该怎么做才能减少它？我尝试过使用高通滤波器,但没有变化.

Answer 1

sta*_*yra 5

我有三个观察要分享。

首先，经过一些尝试后，我得出结论，开始检测算法似乎可能被设计为自动重新调整自己的操作，以便在任何给定时刻考虑本地背景噪声。这可能是为了使其能够以与在强音部分中相同的可能性检测到 pianissimo 部分中的开始时间。这有一个不幸的结果，即该算法倾向于触发来自廉价麦克风的背景噪音 - 开始检测算法诚实地认为它只是在听 pianissimo 音乐。

第二个观察结果是，您记录的示例中大约前 2200 个样本（大约前 0.1 秒）有点不稳定，因为在那个短暂的初始间隔期间噪声确实几乎为零。尝试在起点放大波形，你就会明白我的意思。不幸的是，吉他演奏的开始在噪音开始（大约在样本 3000 左右）之后很快，以至于算法无法独立解决这两个问题——相反，它只是将两者合并为一个开始事件，该事件也开始大约 0.1 秒早期的。因此，我大致剪掉了前 2240 个样本以“规范化”文件（不过我不认为这是作弊；它'

我的第三个观察是基于频率的过滤仅在噪声和音乐实际上处于稍微不同的频段时才有效。在这种情况下这可能是正确的，但是我认为您还没有证明过。因此，我选择尝试一种不同的方法，而不是基于频率的过滤：阈值。我使用了您录音的最后 3 秒（没有吉他弹奏），以估计整个录音过程中的典型背景噪声水平（以 RMS 能量为单位），然后我使用该中值来设置最小能量阈值被计算为安全地高于中位数。只有在 RMS 能量高于阈值时发生的检测器返回的起始事件才被接受为“有效”。

示例脚本如下所示：

import librosa
import numpy as np
import matplotlib.pyplot as plt

# I played around with this but ultimately kept the default value
hoplen=512

y, sr = librosa.core.load("./Vocaroo_s07Dx8dWGAR0.mp3")
# Note that the first ~2240 samples (0.1 seconds) are anomalously low noise,
# so cut out this section from processing
start = 2240
y = y[start:]
idx = np.arange(len(y))

# Calcualte the onset frames in the usual way
onset_frames = librosa.onset.onset_detect(y=y, sr=sr, hop_length=hoplen)
onstm = librosa.frames_to_time(onset_frames, sr=sr, hop_length=hoplen)

# Calculate RMS energy per frame.  I shortened the frame length from the
# default value in order to avoid ending up with too much smoothing
rmse = librosa.feature.rmse(y=y, frame_length=512, hop_length=hoplen)[0,]
envtm = librosa.frames_to_time(np.arange(len(rmse)), sr=sr, hop_length=hoplen)
# Use final 3 seconds of recording in order to estimate median noise level
# and typical variation
noiseidx = [envtm > envtm[-1] - 3.0]
noisemedian = np.percentile(rmse[noiseidx], 50)
sigma = np.percentile(rmse[noiseidx], 84.1) - noisemedian
# Set the minimum RMS energy threshold that is needed in order to declare
# an "onset" event to be equal to 5 sigma above the median
threshold = noisemedian + 5*sigma
threshidx = [rmse > threshold]
# Choose the corrected onset times as only those which meet the RMS energy
# minimum threshold requirement
correctedonstm = onstm[[tm in envtm[threshidx] for tm in onstm]]

# Print both in units of actual time (seconds) and sample ID number
print(correctedonstm+start/sr)
print(correctedonstm*sr+start)

fg = plt.figure(figsize=[12, 8])

# Print the waveform together with onset times superimposed in red
ax1 = fg.add_subplot(2,1,1)
ax1.plot(idx+start, y)
for ii in correctedonstm*sr+start:
    ax1.axvline(ii, color='r')
ax1.set_ylabel('Amplitude', fontsize=16)

# Print the RMSE together with onset times superimposed in red
ax2 = fg.add_subplot(2,1,2, sharex=ax1)
ax2.plot(envtm*sr+start, rmse)
for ii in correctedonstm*sr+start:
    ax2.axvline(ii, color='r')
# Plot threshold value superimposed as a black dotted line
ax2.axhline(threshold, linestyle=':', color='k')
ax2.set_ylabel("RMSE", fontsize=16)
ax2.set_xlabel("Sample Number", fontsize=16)

fg.show()

Run Code Online (Sandbox Code Playgroud)

打印输出如下所示：

In [1]: %run rosatest
[ 0.17124717  1.88952381  3.74712018  5.62793651]
[   3776.   41664.   82624.  124096.]

Run Code Online (Sandbox Code Playgroud)

它产生的图如下所示：

我建议将录制分成大量帧（至少几百甚至几千），然后计算每个帧的 RMSE。然后选择 RMSE 最低的 1%、2% 或 3% 的帧，并假设其中许多是静音的（任何不静音的帧至少都是最弱的）。使用这些框架来估计您的阈值。如果假设是错误的并且存在 < x% 的沉默，这可能会导致高达 x% 的最安静的开始被错误地作为噪声过滤掉，但至少您得到了正确的结果，其余 (100-x)% 的最安静的开始被错误地过滤掉。时间。 (2认同)

归档时间：	8 年，6 月前
查看次数：	737 次
最近记录：	6 年，10 月前