如何解决使用 azureSDK
我正在寻找一个应用程序,他会让我在直播中翻译扬声器发出的任何音频。这样,我将能够翻译来自任何直播应用程序(youtube、团队、缩放等)的任何视频会议。我离解决方案不远了,但还没有。
源语言为:fr-CA
或 en-US
Dst 语言将是:fr-Ca
或 en-US
我能够使用自定义版本的 pyaudio
从扬声器获取音频流,允许使用 Windows 的 WASAPI 进行环回。(https://github.com/intxcc/pyaudio_portaudio)
下一步是将流实时拍摄到 speechsdk
中的 Azure 翻译 API。
到目前为止,从扬声器获取流的部分正在工作,但是当我用 azure 插入它时,我没有任何错误,但它也没有返回任何结果。事实上,每隔大约 30 秒,我就会收到一条 reason=ResultReason.NoMatch
或一条毫无意义的短信。
我的第一个不过是来自扬声器女巫的流字节是 48khz,Azure 流不支持 2 个通道。(我想我在网上的某个地方读到它只支持 16khz 1 个通道,但我不确定)。如果是这样,我找到了一种将两个通道拆分为 1 个的方法,但我不知道如何将一大块字节从 48khz 实时降到 16khz。
任何帮助将不胜感激!谢谢。这是我的代码:
import time
import azure.cognitiveservices.speech as speechsdk
import pyaudio
import numpy as np
speech_key,service_region = "","westus"
finalResultSRC = ""
finalResultDst = ""
RATE = 48000
KHz_RATE = int(RATE/1000)
CHUNK = int(RATE)
def translation_continuous():
"""performs continuous speech translation from input from an audio file"""
# <TranslationContinuous>
# set up translation parameters: source language and target languages
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key,region=service_region,speech_recognition_language='fr-CA')
# setup the audio stream
audioFormat = speechsdk.audio.AudioStreamFormat(
samples_per_second=KHz_RATE,bits_per_sample=16,channels=2)
stream = speechsdk.audio.PushAudioInputStream(audioFormat)
translation_config.add_target_language("en-US")
stream = speechsdk.audio.PushAudioInputStream()
audio_config = speechsdk.audio.AudioConfig(stream=stream)
# Creates a translation recognizer using and audio file as input.
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config,audio_config=audio_config)
def result_callback(event_type,evt):
"""callback to display a translation result"""
# print("{}: {}\n\tTranslations: {}\n\tResult Json: {}".format(
# event_type,evt,evt.result.translations.items(),evt.result.json))
print(evt)
if event_type == "RECOGNIZING":
# Translate
print(evt.result.translations.items()[0][1])
# Original
# print(type(evt.result.json))
done = False
def stop_cb(evt):
"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
print('CLOSING on {}'.format(evt))
nonlocal done
done = True
# connect callback functions to the events fired by the recognizer
recognizer.session_started.connect(
lambda evt: print('SESSION STARTED: {}'.format(evt)))
recognizer.session_stopped.connect(
lambda evt: print('SESSION STOPPED {}'.format(evt)))
# event for intermediate results
recognizer.recognizing.connect(
lambda evt: result_callback('RECOGNIZING',evt))
# event for final result
recognizer.recognized.connect(
lambda evt: result_callback('RECOGNIZED',evt))
# cancellation event
recognizer.canceled.connect(lambda evt: print(
'CANCELED: {} ({})'.format(evt,evt.reason)))
# stop continuous recognition on either session stopped or canceled events
recognizer.session_stopped.connect(stop_cb)
recognizer.canceled.connect(stop_cb)
def synthesis_callback(evt):
"""
callback for the synthesis event
"""
print('SYNTHESIZING {}\n\treceived {} bytes of audio. Reason: {}'.format(
evt,len(evt.result.audio),evt.result.reason))
# connect callback to the synthesis event
recognizer.synthesizing.connect(synthesis_callback)
# start translation
recognizer.start_continuous_recognition()
# start pushing data until all data has been read from the file
try:
p = pyaudio.PyAudio()
pstream = p.open(
format=pyaudio.paInt16,channels=2,rate=RATE,input=True,frames_per_buffer=CHUNK,input_device_index=5,as_loopback=True
)
while(True):
frame = pstream.read(CHUNK)
#frames = wav_fh.readframes(n_bytes)
#print('read {} bytes'.format(len(frames)))
# if not frames:
# print('break')
# break
if frame:
#ch1 = cutChannelFromStream(frame,1,2)
print('got frame from speakers')
stream.write(frame)
time.sleep(1)
finally:
# stop recognition and clean up
stream.close()
recognizer.stop_continuous_recognition()
print(finalResultSRC)
# recognizer.stop_continuous_recognition()
# </TranslationContinuous>
translation_continuous()
解决方法
我找到了一个可行的解决方案。我确实不得不降采样到 16000hz 并使用单声道。我的代码基于此 Solution,但使用流块而不是从文件中读取。
我的功能是:
def downsampleFrames(data,inrate=48000,outrate=16000,inchannels=2,outchannels=1):
try:
converted = audioop.ratecv(data,2,inchannels,inrate,outrate,None)
if outchannels == 1:
converted = audioop.tomono(converted[0],1,0)
except:
print('Failed to downsample')
return False
return converted
从pyaudio,我发送了这样的数据块:
p = pyaudio.PyAudio()
pstream = p.open(
format=pyaudio.paInt16,channels=2,rate=RATE,input=True,frames_per_buffer=CHUNK,input_device_index=5,as_loopback=True
)
while(True):
frame = pstream.read(CHUNK)
if frame:
downFrame = downsampleFrames(frame)
stream.write(downFrame)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。