Ima*_*ess 2 sockets node.js webrtc google-speech-api
简而言之,这就是我正在尝试做的事情:
\n浏览器/WebRTc音频==>服务器端(节点js)socket.io服务器==>谷歌云
\nI\xe2\x80\x99m 在浏览器中使用 webRTC 从浏览器麦克风捕获音频。该音频在传入时作为带有 base64 字符串的对象发送到 socket.io 服务器。这部分工作是因为我在记录传入数据时可以看到它。
\n我陷入困境的是将此流发送到谷歌云语音API以将其转录。
\n谷歌云语音文档中有一个快速入门应用程序,用于将麦克风数据流式传输到谷歌语音并获取实时转录。我设法让它工作,但它使用计算机\xe2\x80\x99s 麦克风。该应用程序使用节点node-record-lpcm16和SoX来访问computer\xe2\x80\x99s麦克风并将流传输到google cloud api。
\n音频通过 SpeechClient 上的 StreamingRecognize 方法发送到谷歌云。请求对象被传递给该方法。请求对象有一个名为audio_content的字段,这是我认为传入的音频流应该去的地方(???)。
\n下面是包含 socket.io 实例和来自与 node-record-lpcm16 包一起使用的 google cloud Quickstart 应用程序的代码的服务器文件。
\nlet io = require(\'socket.io\')(3000, {\n cors: {origin: [\'http://localhost:8080\']},\n})\n\nconst speech = require(\'@google-cloud/speech\');\n \n// Create a speech client\nconst client = new speech.SpeechClient();\n\n\nconst encoding = \'LINEAR16\';\nconst sampleRateHertz = 16000;\nconst languageCode = \'en-US\';\n\n//speech client request header\nconst request = {\n config: {\n encoding: encoding,\n sampleRateHertz: sampleRateHertz,\n languageCode: languageCode,\n enableAutomaticPunctuation: true,\n },\n interimResults: false, // If you want interim results, set this to true\n};\n\n // Create a recognize stream, this makes a request and waits for response (transcription)\n const recognizeStream = client\n .streamingRecognize(request) //send request passed to streamingRecognize method\n .on(\'error\', console.error) //throw error if error returned\n .on(\'data\', data =>\n {\n console.log(data.results[0].alternatives[0].words)\n process.stdout.write(\n \n data.results[0] && data.results[0].alternatives[0]\n ? `Transcription: ${data.results[0].alternatives[0].transcript}\\n`\n : \'\\n\\nReached transcription time limit, press Ctrl+C\\n\'\n )\n }\n );\n\n\n\n//Create socket and listen for audio stream from webRTC\n\nio.on(\'connection\', socket => {\n console.log(socket.id)\n\n //TODO: how to send this stream to google speech?\n socket.on(\'audioStream\', (obj) => {\n //obj is JSON object structured like this: {"audio_data": base64 string....}\n \n //verified here that stream is being received continuously\n console.log(obj)\n \n })\n \n})\n\nconsole.log(\'socket server running\')\n\n\nRun Code Online (Sandbox Code Playgroud)\n这是从 webRTC 脚本发送音频数据的方式:
\n socket.emit(\'audioStream\', \n { audio_data: base64data.split(\'base64,\')[1]}\n )\nRun Code Online (Sandbox Code Playgroud)\n
如果您只是对转录视频中的音频感兴趣,我建议您使用Web Audio Api 。
\n以下是我使用 Nodejs 服务器和 React 客户端应用程序完成此操作的方法。这里已经上传到github了
\nWorklet.addModule()API 需要包含要添加的模块的 JavaScript 文件的 URL。请参阅 MDN 上的文档。通过将其放入 public 文件夹中,它将从我们的 Web 应用程序加载\'静态文件)recorderWorkletProcessor.js(保存在public/src/worklets/recorderWorkletProcessor.js)
/**\n An in-place replacement for ScriptProcessorNode using AudioWorklet\n*/\nclass RecorderProcessor extends AudioWorkletProcessor {\n // 0. Determine the buffer size (this is the same as the 1st argument of ScriptProcessor)\n bufferSize = 2048;\n // 1. Track the current buffer fill level\n _bytesWritten = 0;\n\n // 2. Create a buffer of fixed size\n _buffer = new Float32Array(this.bufferSize);\n\n constructor() {\n super();\n this.initBuffer();\n }\n\n initBuffer() {\n this._bytesWritten = 0;\n }\n\n isBufferEmpty() {\n return this._bytesWritten === 0;\n }\n\n isBufferFull() {\n return this._bytesWritten === this.bufferSize;\n }\n\n /**\n * @param {Float32Array[][]} inputs\n * @returns {boolean}\n */\n process(inputs) {\n // Grabbing the 1st channel similar to ScriptProcessorNode\n this.append(inputs[0][0]);\n\n return true;\n }\n\n /**\n *\n * @param {Float32Array} channelData\n */\n append(channelData) {\n if (this.isBufferFull()) {\n this.flush();\n }\n\n if (!channelData) return;\n\n for (let i = 0; i < channelData.length; i++) {\n this._buffer[this._bytesWritten++] = channelData[i];\n }\n }\n\n flush() {\n // trim the buffer if ended prematurely\n const buffer = this._bytesWritten < this.bufferSize ? this._buffer.slice(0, this._bytesWritten) : this._buffer;\n const result = this.downsampleBuffer(buffer, 44100, 16000);\n this.port.postMessage(result);\n this.initBuffer();\n }\n\n downsampleBuffer(buffer, sampleRate, outSampleRate) {\n if (outSampleRate == sampleRate) {\n return buffer;\n }\n if (outSampleRate > sampleRate) {\n throw new Error("downsampling rate show be smaller than original sample rate");\n }\n var sampleRateRatio = sampleRate / outSampleRate;\n var newLength = Math.round(buffer.length / sampleRateRatio);\n var result = new Int16Array(newLength);\n var offsetResult = 0;\n var offsetBuffer = 0;\n while (offsetResult < result.length) {\n var nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);\n var accum = 0,\n count = 0;\n for (var i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {\n accum += buffer[i];\n count++;\n }\n\n result[offsetResult] = Math.min(1, accum / count) * 0x7fff;\n offsetResult++;\n offsetBuffer = nextOffsetBuffer;\n }\n return result.buffer;\n }\n}\n\nregisterProcessor("recorder.worklet", RecorderProcessor);\n\n\nRun Code Online (Sandbox Code Playgroud)\n在前端安装Socket.io-client
\nnpm i socket.io-client\nRun Code Online (Sandbox Code Playgroud)\n反应组件代码
\n/* eslint-disable react-hooks/exhaustive-deps */\nimport { default as React, useEffect, useState, useRef } from "react";\nimport { Button } from "react-bootstrap";\nimport Container from "react-bootstrap/Container";\nimport * as io from "socket.io-client";\n\nconst sampleRate = 16000;\n\nconst getMediaStream = () =>\n navigator.mediaDevices.getUserMedia({\n audio: {\n deviceId: "default",\n sampleRate: sampleRate,\n sampleSize: 16,\n channelCount: 1,\n },\n video: false,\n });\n\ninterface WordRecognized {\n final: boolean;\n text: string;\n}\n\nconst AudioToText: React.FC = () => {\n const [connection, setConnection] = useState<io.Socket>();\n const [currentRecognition, setCurrentRecognition] = useState<string>();\n const [recognitionHistory, setRecognitionHistory] = useState<string[]>([]);\n const [isRecording, setIsRecording] = useState<boolean>(false);\n const [recorder, setRecorder] = useState<any>();\n const processorRef = useRef<any>();\n const audioContextRef = useRef<any>();\n const audioInputRef = useRef<any>();\n\n const speechRecognized = (data: WordRecognized) => {\n if (data.final) {\n setCurrentRecognition("...");\n setRecognitionHistory((old) => [data.text, ...old]);\n } else setCurrentRecognition(data.text + "...");\n };\n\n const connect = () => {\n connection?.disconnect();\n const socket = io.connect("http://localhost:8081");\n socket.on("connect", () => {\n console.log("connected", socket.id);\n setConnection(socket);\n });\n\n socket.emit("send_message", "hello world");\n\n socket.emit("startGoogleCloudStream");\n\n socket.on("receive_message", (data) => {\n console.log("received message", data);\n });\n\n socket.on("receive_audio_text", (data) => {\n speechRecognized(data);\n console.log("received audio text", data);\n });\n\n socket.on("disconnect", () => {\n console.log("disconnected", socket.id);\n });\n };\n\n const disconnect = () => {\n if (!connection) return;\n connection?.emit("endGoogleCloudStream");\n connection?.disconnect();\n processorRef.current?.disconnect();\n audioInputRef.current?.disconnect();\n audioContextRef.current?.close();\n setConnection(undefined);\n setRecorder(undefined);\n setIsRecording(false);\n };\n\n useEffect(() => {\n (async () => {\n if (connection) {\n if (isRecording) {\n return;\n }\n\n const stream = await getMediaStream();\n\n audioContextRef.current = new window.AudioContext();\n\n await audioContextRef.current.audioWorklet.addModule(\n "/src/worklets/recorderWorkletProcessor.js"\n );\n\n audioContextRef.current.resume();\n\n audioInputRef.current =\n audioContextRef.current.createMediaStreamSource(stream);\n\n processorRef.current = new AudioWorkletNode(\n audioContextRef.current,\n "recorder.worklet"\n );\n\n processorRef.current.connect(audioContextRef.current.destination);\n audioContextRef.current.resume();\n\n audioInputRef.current.connect(processorRef.current);\n\n processorRef.current.port.onmessage = (event: any) => {\n const audioData = event.data;\n connection.emit("send_audio_data", { audio: audioData });\n };\n setIsRecording(true);\n } else {\n console.error("No connection");\n }\n })();\n return () => {\n if (isRecording) {\n processorRef.current?.disconnect();\n audioInputRef.current?.disconnect();\n if (audioContextRef.current?.state !== "closed") {\n audioContextRef.current?.close();\n }\n }\n };\n }, [connection, isRecording, recorder]);\n\n return (\n <React.Fragment>\n <Container className="py-5 text-center">\n <Container fluid className="py-5 bg-primary text-light text-center ">\n <Container>\n <Button\n className={isRecording ? "btn-danger" : "btn-outline-light"}\n onClick={connect}\n disabled={isRecording}\n >\n Start\n </Button>\n <Button\n className="btn-outline-light"\n onClick={disconnect}\n disabled={!isRecording}\n >\n Stop\n </Button>\n </Container>\n </Container>\n <Container className="py-5 text-center">\n {recognitionHistory.map((tx, idx) => (\n <p key={idx}>{tx}</p>\n ))}\n <p>{currentRecognition}</p>\n </Container>\n </Container>\n </React.Fragment>\n );\n};\n\nexport default AudioToText;\n\n\nRun Code Online (Sandbox Code Playgroud)\n服务器.js
\nconst express = require("express");\nconst speech = require("@google-cloud/speech");\n\n//use logger\nconst logger = require("morgan");\n\n//use body parser\nconst bodyParser = require("body-parser");\n\n//use corrs\nconst cors = require("cors");\n\nconst http = require("http");\nconst { Server } = require("socket.io");\n\nconst app = express();\n\napp.use(cors());\napp.use(logger("dev"));\n\napp.use(bodyParser.json());\n\nconst server = http.createServer(app);\n\nconst io = new Server(server, {\n cors: {\n origin: "http://localhost:3000",\n methods: ["GET", "POST"],\n },\n});\n\n//TODO: run in terminal first to setup credentials export GOOGLE_APPLICATION_CREDENTIALS="./speech-to-text-key.json"\n\nconst speechClient = new speech.SpeechClient();\n\nio.on("connection", (socket) => {\n let recognizeStream = null;\n console.log("** a user connected - " + socket.id + " **\\n");\n\n socket.on("disconnect", () => {\n console.log("** user disconnected ** \\n");\n });\n\n socket.on("send_message", (message) => {\n console.log("message: " + message);\n setTimeout(() => {\n io.emit("receive_message", "got this message" + message);\n }, 1000);\n });\n\n socket.on("startGoogleCloudStream", function (data) {\n startRecognitionStream(this, data);\n });\n\n socket.on("endGoogleCloudStream", function () {\n console.log("** ending google cloud stream **\\n");\n stopRecognitionStream();\n });\n\n socket.on("send_audio_data", async (audioData) => {\n io.emit("receive_message", "Got audio data");\n if (recognizeStream !== null) {\n try {\n recognizeStream.write(audioData.audio);\n } catch (err) {\n console.log("Error calling google api " + err);\n }\n } else {\n console.log("RecognizeStream is null");\n }\n });\n\n function startRecognitionStream(client) {\n console.log("* StartRecognitionStream\\n");\n try {\n recognizeStream = speechClient\n .streamingRecognize(request)\n .on("error", console.error)\n .on("data", (data) => {\n const result = data.results[0];\n const isFinal = result.isFinal;\n\n const transcription = data.results\n .map((result) => result.alternatives[0].transcript)\n .join("\\n");\n\n console.log(`Transcription: `, transcription);\n\n client.emit("receive_audio_text", {\n text: transcription,\n final: isFinal,\n });\n });\n } catch (err) {\n console.error("Error streaming google api " + err);\n }\n }\n\n function stopRecognitionStream() {\n if (recognizeStream) {\n console.log("* StopRecognitionStream \\n");\n recognizeStream.end();\n }\n recognizeStream = null;\n }\n});\n\nserver.listen(8081, () => {\n console.log("WebSocket server listening on port 8081.");\n});\n\n// =========================== GOOGLE CLOUD SETTINGS ================================ //\n\n// The encoding of the audio file, e.g. \'LINEAR16\'\n// The sample rate of the audio file in hertz, e.g. 16000\n// The BCP-47 language code to use, e.g. \'en-US\'\nconst encoding = "LINEAR16";\nconst sampleRateHertz = 16000;\nconst languageCode = "en-US"; //en-US\nconst alternativeLanguageCodes = ["en-US", "ko-KR"];\n\nconst request = {\n config: {\n encoding: encoding,\n sampleRateHertz: sampleRateHertz,\n languageCode: languageCode,\n //alternativeLanguageCodes: alternativeLanguageCodes,\n enableWordTimeOffsets: true,\n enableAutomaticPunctuation: true,\n enableWordConfidence: true,\n enableSpeakerDiarization: true,\n diarizationSpeakerCount: 2,\n model: "video",\n //model: "command_and_search",\n useEnhanced: true,\n speechContexts: [\n {\n phrases: ["hello", "\xec\x95\x88\xeb\x85\x95\xed\x95\x98\xec\x84\xb8\xec\x9a\x94"],\n },\n ],\n },\n interimResults: true,\n};\n\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
2311 次 |
| 最近记录: |