使用Java SDK将音频从mic流式传输到IBM Watson SpeechToText Web服务

Question

使用Java SDK将音频从mic流式传输到IBM Watson SpeechToText Web服务

Rob*_*zuk 5 java speech-to-text ibm-watson

尝试使用Java SDK将来自麦克风的连续音频流直接发送到IBM Watson SpeechToText Web服务.distribution(RecognizeUsingWebSocketsExample)提供的示例之一显示了如何将.WAV格式的文件流式传输到服务.但是,.WAV文件要求提前指定它们的长度,因此一次只将一个缓冲区附加到文件的简单方法是不可行的.

它似乎SpeechToText.recognizeUsingWebSocket可以采取一个流,但喂它一个实例AudioInputStream似乎并没有这样做似乎连接已建立但即使没有返回成绩单RecognizeOptions.interimResults(true).

public class RecognizeUsingWebSocketsExample {
private static CountDownLatch lock = new CountDownLatch(1);

public static void main(String[] args) throws FileNotFoundException, InterruptedException {
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

AudioInputStream audio = null;

try {
    final AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
    DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
    TargetDataLine line;
    line = (TargetDataLine)AudioSystem.getLine(info);
    line.open(format);
    line.start();
    audio = new AudioInputStream(line);
    } catch (LineUnavailableException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

RecognizeOptions options = new RecognizeOptions.Builder()
    .continuous(true)
    .interimResults(true)
    .contentType(HttpMediaType.AUDIO_WAV)
    .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
    if (speechResults.isFinal())
      lock.countDown();
  }
});

lock.await(1, TimeUnit.MINUTES);
}
}

Run Code Online (Sandbox Code Playgroud)

任何帮助将不胜感激.

-rg

以下是基于德语评论的更新(感谢您).

我能够使用javaFlacEncode将从麦克风到达的WAV流转换为FLAC流并将其保存到临时文件中.与创建时固定大小的WAV音频文件不同,可以轻松附加FLAC文件.

    WAV_audioInputStream = new AudioInputStream(line);
    FileInputStream FLAC_audioInputStream = new FileInputStream(tempFile);

    StreamConfiguration streamConfiguration = new StreamConfiguration();
    streamConfiguration.setSampleRate(16000);
    streamConfiguration.setBitsPerSample(8);
    streamConfiguration.setChannelCount(1);

    flacEncoder = new FLACEncoder();
    flacOutputStream = new FLACFileOutputStream(tempFile);  // write to temp disk file

    flacEncoder.setStreamConfiguration(streamConfiguration);
    flacEncoder.setOutputStream(flacOutputStream);

    flacEncoder.openFLACStream();

    ...
    // convert data
    int frameLength = 16000;
    int[] intBuffer = new int[frameLength];
    byte[] byteBuffer = new byte[frameLength];

    while (true) {
        int count = WAV_audioInputStream.read(byteBuffer, 0, frameLength);
        for (int j1=0;j1<count;j1++)
            intBuffer[j1] = byteBuffer[j1];

        flacEncoder.addSamples(intBuffer, count);
        flacEncoder.encodeSamples(count, false);  // 'false' means non-final frame
    }

    flacEncoder.encodeSamples(flacEncoder.samplesAvailableToEncode(), true);  // final frame
    WAV_audioInputStream.close();
    flacOutputStream.close();
    FLAC_audioInputStream.close();

Run Code Online (Sandbox Code Playgroud)

在添加任意数量的帧之后,可以分析(使用curl或recognizeUsingWebSocket())没有任何问题的结果文件.但是,recognizeUsingWebSocket()一旦到达FLAC文件的末尾,它将返回最终结果,即使文件的最后一帧可能不是最终的(即,之后encodeSamples(count, false)).

我希望recognizeUsingWebSocket()阻止,直到最后一帧被写入文件.实际上,这意味着分析在第一帧之后停止,因为分析第一帧比收集第二帧花费的时间更少,因此在返回结果时,到达文件的结尾.

这是从Java中用麦克风实现流式音频的正确方法吗？似乎是一个常见的用例.

以下是RecognizeUsingWebSocketsExample对Daniel的一些建议的修改.它使用PCM内容类型(作为a String,与帧大小一起传递),并尝试发出音频流的结束信号,尽管不是非常成功的.

和以前一样,建立连接,但永远不会调用识别回调.关闭流似乎也不会被解释为音频的结束.我一定是在误解这里的东西......

    public static void main(String[] args) throws IOException, LineUnavailableException, InterruptedException {

    final PipedOutputStream output = new PipedOutputStream();
    final PipedInputStream  input  = new PipedInputStream(output);

  final AudioFormat format = new AudioFormat(16000, 8, 1, true, false);
  DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
  final TargetDataLine line = (TargetDataLine)AudioSystem.getLine(info);
  line.open(format);
  line.start();

    Thread thread1 = new Thread(new Runnable() {
        @Override
        public void run() {
            try {
              final int MAX_FRAMES = 2;
              byte buffer[] = new byte[16000];
              for(int j1=0;j1<MAX_FRAMES;j1++) {  // read two frames from microphone
              int count = line.read(buffer, 0, buffer.length);
              System.out.println("Read audio frame from line: " + count);
              output.write(buffer, 0, buffer.length);
              System.out.println("Written audio frame to pipe: " + count);
              }
              /** no need to fake end-of-audio;  StopMessage will be sent 
              * automatically by SDK once the pipe is drained (see WebSocketManager)
              // signal end of audio; based on WebSocketUploader.stop() source
              byte[] stopData = new byte[0];
              output.write(stopData);
              **/
            } catch (IOException e) {
            }
        }
    });
    thread1.start();

  final CountDownLatch lock = new CountDownLatch(1);

  SpeechToText service = new SpeechToText();
  service.setUsernameAndPassword("<username>", "<password>");

  RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(false)
  .contentType("audio/pcm; rate=16000")
  .build();

  service.recognizeUsingWebSocket(input, options, new BaseRecognizeCallback() {
    @Override
    public void onConnected() {
      System.out.println("Connected.");
    }
    @Override
    public void onTranscription(SpeechResults speechResults) {
    System.out.println("Received results.");
      System.out.println(speechResults);
      if (speechResults.isFinal())
        lock.countDown();
    }
  });

  System.out.println("Waiting for STT callback ... ");

  lock.await(5, TimeUnit.SECONDS);

  line.stop();

  System.out.println("Done waiting for STT callback.");

}

Run Code Online (Sandbox Code Playgroud)

Dani,我检测了源代码WebSocketManager(附带SDK)并sendMessage()用一个显式StopMessage有效负载替换了一个调用,如下所示:

        /**
     * Send input steam.
     *
     * @param inputStream the input stream
     * @throws IOException Signals that an I/O exception has occurred.
     */
    private void sendInputSteam(InputStream inputStream) throws IOException {
      int cumulative = 0;
      byte[] buffer = new byte[FOUR_KB];
      int read;
      while ((read = inputStream.read(buffer)) > 0) {
        cumulative += read;
        if (read == FOUR_KB) {
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, buffer));
        } else {
          System.out.println("completed sending " + cumulative/16000 + " frames over socket");
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, Arrays.copyOfRange(buffer, 0, read)));  // partial buffer write
          System.out.println("signaling end of audio");
          socket.sendMessage(RequestBody.create(WebSocket.TEXT, buildStopMessage().toString()));  // end of audio signal

        }

      }
      inputStream.close();
    }

Run Code Online (Sandbox Code Playgroud)

sendMessage()选项(发送0长度二进制内容或发送停止文本消息)似乎都不起作用.来电代码与上述相同.结果输出是:

Waiting for STT callback ... 
Connected.
Read audio frame from line: 16000
Written audio frame to pipe: 16000
Read audio frame from line: 16000
Written audio frame to pipe: 16000
completed sending 2 frames over socket
onFailure: java.net.SocketException: Software caused connection abort: socket write error

Run Code Online (Sandbox Code Playgroud)

修订:实际上,从未达到音频结束通话.将最后(部分)缓冲区写入套接字时抛出异常.

为什么连接中止？这通常发生在对等方关闭连接时.

至于第2点):在这个阶段,这些问题中的任何一个都是重要的吗？似乎根本没有启动识别过程......音频是有效的(我将流写入磁盘,并且能够通过从文件中流式传输来识别它,正如我在上面指出的那样).

此外,在的进一步审查WebSocketManager源代码,onMessage()已发送StopMessage后立即return从sendInputSteam()(ie.e.,当音频流,或管在上面的例子中,漏极),所以不需要显式调用它.问题肯定发生在音频数据传输完成之前.该行为是相同的,如果不考虑PipedInputStream或AudioInputStream作为输入传递.在两种情况下发送二进制数据时都会抛出异常.

Answer 1

Ger*_*sio 6

Java SDK有一个示例并支持此功能.

更新你pom.xml的:

 <dependency>
   <groupId>com.ibm.watson.developer_cloud</groupId>
   <artifactId>java-sdk</artifactId>
   <version>3.3.1</version>
 </dependency>

Run Code Online (Sandbox Code Playgroud)

以下是如何收听麦克风的示例.

SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

// Signed PCM AudioFormat with 16kHz, 16 bit sample size, mono
int sampleRate = 16000;
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);

if (!AudioSystem.isLineSupported(info)) {
  System.out.println("Line not supported");
  System.exit(0);
}

TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info);
line.open(format);
line.start();

AudioInputStream audio = new AudioInputStream(line);

RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(true)
  .timestamps(true)
  .wordConfidence(true)
  //.inactivityTimeout(5) // use this to stop listening when the speaker pauses, i.e. for 5s
  .contentType(HttpMediaType.AUDIO_RAW + "; rate=" + sampleRate)
  .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
  }
});

System.out.println("Listening to your voice for the next 30s...");
Thread.sleep(30 * 1000);

// closing the WebSockets underlying InputStream will close the WebSocket itself.
line.stop();
line.close();

System.out.println("Fin.");

Run Code Online (Sandbox Code Playgroud)

Answer 2

Dan*_*nos 0

您需要做的是将音频不是作为文件，而是作为音频样本的无头流提供给 STT 服务。您只需通过 WebSocket 提供从麦克风捕获的样本即可。您需要将内容类型设置为“audio/pcm；rate=16000”，其中 16000 是以 Hz 为单位的采样率。如果您的采样率不同（取决于麦克风对音频进行编码的方式），您将用您的值替换 16000，例如：44100、48000 等。

当提供 pcm 音频时，STT 服务不会停止识别，直到您通过 Websocket 发送空二进制消息来表示音频结束。

达尼

查看您的代码的新版本，我发现一些问题：

1) 可以通过 websocket 发送空的二进制消息来发送音频结束信号，但这不是您正在做的事情。线条

 // signal end of audio; based on WebSocketUploader.stop() source
 byte[] stopData = new byte[0];
 output.write(stopData);

Run Code Online (Sandbox Code Playgroud)

不执行任何操作，因为它们不会导致发送空的 websocket 消息。您可以调用“WebSocketUploader.stop()”方法吗？

您以每个样本 8 位捕获音频，为了获得足够的质量，您应该捕获 16 位。此外，您只提供几秒钟的音频，这对于测试来说并不理想。您能否将推送到 STT 的任何音频写入文件，然后使用 Audacity 打开它（使用导入功能）？这样您就可以确保向 STT 提供的音频是良好的音频。

归档时间：	9 年，8 月前
查看次数：	3612 次
最近记录：	7 年，10 月前