如何在Python中使用麦克风获得准确的计时

Ter*_*r-5 3 python signal-processing timing detection pyaudio

我正在尝试使用 PC 麦克风进行节拍检测,然后使用节拍时间戳计算多个连续节拍之间的距离。我选择Python是因为有大量可用的材料并且开发速度很快。通过搜索互联网,我想出了这个简单的代码(还没有高级峰值检测或任何东西,如果需要的话稍后会出现):

import pyaudio
import struct
import math
import time


SHORT_NORMALIZE = (1.0/32768.0)


def get_rms(block):
    # RMS amplitude is defined as the square root of the
    # mean over time of the square of the amplitude.
    # so we need to convert this string of bytes into
    # a string of 16-bit samples...

    # we will get one short out for each
    # two chars in the string.
    count = len(block)/2
    format = "%dh" % (count)
    shorts = struct.unpack(format, block)

    # iterate over the block.
    sum_squares = 0.0
    for sample in shorts:
        # sample is a signed short in +/- 32768.
        # normalize it to 1.0
        n = sample * SHORT_NORMALIZE
        sum_squares += n*n

    return math.sqrt(sum_squares / count)


CHUNK = 32
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)

elapsed_time = 0
prev_detect_time = 0

while True:
    data = stream.read(CHUNK)
    amplitude = get_rms(data)
    if amplitude > 0.05:  # value set by observing graphed data captured from mic
        elapsed_time = time.perf_counter() - prev_detect_time
        if elapsed_time > 0.1:  # guard against multiple spikes at beat point
            print(elapsed_time)
            prev_detect_time = time.perf_counter()

def close_stream():
  stream.stop_stream()
  stream.close()
  p.terminate()
Run Code Online (Sandbox Code Playgroud)

该代码在安静的情况下运行得非常好,并且在运行它的前两个时刻我非常满意,但是当我尝试了它的准确性时,我有点不太满意。为了测试这一点,我使用了两种方法:将节拍器设置为 60bpm 的手机(向麦克风发出 tic toc 声音)和连接到蜂鸣器的 Arduino,蜂鸣器由精确的 Chronodot RTC 以 1Hz 的速率触发。蜂鸣器向麦克风发出蜂鸣声,触发检测。两种方法的结果看起来相似(数字表示两次节拍检测之间的距离(以秒为单位)):

0.9956681643835616
1.0056331689497717
0.9956100091324198
1.0058207853881278
0.9953449497716891
1.0052103013698623
1.0049350136986295
0.9859074337899543
1.004996383561644
0.9954095342465745
1.0061518904109583
0.9953025753424658
1.0051235068493156
1.0057199634703196
0.984839305936072
1.00610396347032
0.9951862648401821
1.0053146301369864
0.9960100821917806
1.0053391780821919
0.9947373881278523
1.0058608219178105
1.0056580091324214
0.9852110319634697
1.0054473059360731
0.9950465753424638
1.0058237077625556
0.995704694063928
1.0054566575342463
0.9851026118721435
1.0059882374429243
1.0052523835616398
0.9956161461187207
1.0050863926940607
0.9955758173515932
1.0058052968036577
0.9953960913242028
1.0048014611872205
1.006336876712325
0.9847434520547935
1.0059712876712297
Run Code Online (Sandbox Code Playgroud)

现在我非常有信心 Arduino 至少可以精确到 1 毫秒(这是目标精度)。结果往往会偏差±5毫秒,但有时甚至偏差15毫秒,这是不可接受的。有没有办法实现更高的准确性,或者是 python/声卡/其他东西的限制?谢谢你!

编辑:将 tom10 和 barny 的建议合并到代码中后,代码如下所示:

import pyaudio
import struct
import math
import psutil
import os


def set_high_priority():
    p = psutil.Process(os.getpid())
    p.nice(psutil.HIGH_PRIORITY_CLASS)


SHORT_NORMALIZE = (1.0/32768.0)


def get_rms(block):
    # RMS amplitude is defined as the square root of the
    # mean over time of the square of the amplitude.
    # so we need to convert this string of bytes into
    # a string of 16-bit samples...

    # we will get one short out for each
    # two chars in the string.
    count = len(block)/2
    format = "%dh" % (count)
    shorts = struct.unpack(format, block)

    # iterate over the block.
    sum_squares = 0.0
    for sample in shorts:
        # sample is a signed short in +/- 32768.
        # normalize it to 1.0
        n = sample * SHORT_NORMALIZE
        sum_squares += n*n

    return math.sqrt(sum_squares / count)


CHUNK = 4096
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RUNTIME_SECONDS = 10

set_high_priority()

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)

elapsed_time = 0
prev_detect_time = 0
TIME_PER_CHUNK = 1000 / RATE * CHUNK
SAMPLE_GROUP_SIZE = 32  # 1 sample = 2 bytes, group is closest to 1 msec elapsing
TIME_PER_GROUP = 1000 / RATE * SAMPLE_GROUP_SIZE

for i in range(0, int(RATE / CHUNK * RUNTIME_SECONDS)):
    data = stream.read(CHUNK)
    time_in_chunk = 0
    group_index = 0
    for j in range(0, len(data), (SAMPLE_GROUP_SIZE * 2)):
        group = data[j:(j + (SAMPLE_GROUP_SIZE * 2))]
        amplitude = get_rms(group)
        amplitudes.append(amplitude)
        if amplitude > 0.02:
            current_time = (elapsed_time + time_in_chunk)
            time_since_last_beat = current_time - prev_detect_time
            if time_since_last_beat > 500:
                print(time_since_last_beat)
                prev_detect_time = current_time
        time_in_chunk = (group_index+1) * TIME_PER_GROUP
        group_index += 1
    elapsed_time = (i+1) * TIME_PER_CHUNK

stream.stop_stream()
stream.close()
p.terminate()
Run Code Online (Sandbox Code Playgroud)

通过这段代码,我获得了以下结果(单位是毫秒而不是秒):

999.909297052154
999.9092970521542
999.9092970521542
999.9092970521542
999.9092970521542
1000.6349206349205
999.9092970521551
999.9092970521524
999.9092970521542
999.909297052156
999.9092970521542
999.9092970521542
999.9092970521524
999.9092970521542
Run Code Online (Sandbox Code Playgroud)

如果我没有犯任何错误的话,它看起来比以前好多了,并且已经达到了亚毫秒级的精度。我感谢 tom10 和 barny 的帮助。

tom*_*m10 5

您没有获得正确的节拍时机的原因是您丢失了音频数据块。也就是说,声卡正在读取这些块,但在数据被下一个块覆盖之前您不会收集数据

不过,首先,对于这个问题,您需要区分计时精度实时响应的概念。

声卡的计时精度应该非常好,比毫秒好得多,并且您应该能够在从声卡读取的数据中捕获所有这种精度。你的电脑操作系统的实时响应能力应该很差,比ms差很多。 也就是说,您应该能够轻松地在一毫秒内识别音频事件(例如节拍),但不能在它们发生时识别它们(而是在 30-200 毫秒后识别,具体取决于您的系统)。 这种安排通常适用于计算机,因为一般人类对事件时间的感知远大于毫秒(除了罕见的专门感知系统,例如比较两耳之间的听觉事件等)。

您的代码的具体问题是,CHUNKS对于操作系统来说,它太小,无法在每个样本中查询声卡。它的频率为 32,因此在 44100Hz 时,操作系统需要每 0.7 毫秒访问一次声卡,对于负责执行许多其他操作的计算机来说,这个时间太短了。如果您的操作系统在下一个块进入之前没有获取该块,则原始块将被覆盖并丢失。

为了使其正常工作,使其与上述约束一致,请使其CHUNKS比 更大32,并且更像1024(如 PyAudio 示例中所示)。根据您的计算机及其正在执行的操作,即使时间不够长。

如果这种方法不适合您,您可能需要一个专用的实时系统,例如 Arduino。(不过,一般来说,这是没有必要的,所以在决定是否需要使用 Arduino 之前请三思而后行。通常,当我看到人们需要真正的实时性时,是在尝试做一些与人类进行非常定量交互的事情时) ,比如闪烁一盏灯,让人们点击一个按钮,闪烁另一个灯,让人们点击另一个按钮,等等,以测量响应时间。)