Integration
Objective
In this article you will learn how to integrate real-time transcription using native Python code. The objective is to show how different components of real-time transcription work and the concepts behind them. This is so you can draw parallels, for implementing it in your programming language of choice.
Requirements
In order to follow along with this guide, install the following Python packages:
Before installing Pyaudio you might need portaudio
installed.
- For Debian/Ubuntu Linux
apt install portaudio19-dev
- For macOS
brew install portaudio
Concepts
Get List of Supported Languages
The first step is to verify if the language you are trying to transcribe real-time is supported by the VoiceAI platform. To check, you can fetch the languages supported using the following code:
import requests
import os
BASE = "https://voice.neuralspace.ai/api/v2/languages?type=stream"
API_KEY = os.environ["NS_API_KEY"]
assert API_KEY is not None
headers = {
"Authorization": API_KEY
}
response = requests.get(BASE, headers=headers)
assert response.status_code == 200
print(f"Languages supported: {response.json()['data']['languages']}")
For the purpose of this guide, all the examples will be based on the English (en
) language.
Short-lived Token
The first step to start streaming with NeuralSpace APIs is to get a short lived token, i.e., a token which can be configured to expire in a certain amount of time.
DURATION = 600
TOKEN_URL = f"https://voice.neuralspace.ai/api/v2/token?duration={DURATION}"
response = requests.get(TOKEN_URL, headers=headers)
assert response.status_code == 200
session_token = response.json()['data']['token']
WebSockets
NeuralSpace uses WebSockets for real-time transcription. In order to use websockets, there are mainly 3 steps:
- create a connection to the websocket
- send chunks of audios to the websocket
- recieve the results in a different thread
In the given example the client package used in Python is websocket-client
which can be installed by running the following in shell:
pip install websocket-client
- It is not necessary to recieve data for every chunk of audio that is processed. In the Backend your data is consumed and if certain conditions are met e.g. minimum of 1 second of new data is collected then the backend will process the provided data and then return the result.
- Hence, it is better to have separate threads for sending and recieving data from websockets, otherwise your main thread can get stuck on a blocking condition.
Setting the Options and creating the WebSocket
You can configure various parameters like the language, as well as advanced parameters like chunk size, silence thresholds, etc.
The different paramters that can be set are below:
Parameter | Required | Default | Type | All options | Description |
---|---|---|---|---|---|
language | Yes | None | string | en , en-in , ar , ar-ms , ar-eg , ar-sa , hi , tl | Set audio language for real-time transcription. |
max_chunk_size | No | 5 | int | Any integer value >= min_chunk_size | Specifies the maximum size in seconds of audio chunks that are processed for full results even if the speaker has not stopped. Larger chunks might provide better transcription accuracy but will require speaker to finish speaking. |
vad_threshold | No | 0.5 | float | Any positive floating point value | This threshold determines the sensitivity of the Voice Activity Detection (VAD) mechanism. VAD is used to detect the presence of speech in the audio stream. A lower value will make the detection more sensitive to noise, while a higher value might skip quieter speech parts. |
disable_partial | No | False | bool | True or False | When set to True, the system will not return partial transcriptions (i.e., intermediate transcriptions before the speech segment is completed). When set to False, you might receive multiple transcriptions for the same speech segment, with each one being more complete than the previous. |
format | No | pcm_16k | string | pcm_8k or pcm_16k | If the audio being streamed is of 8KHz then the format needs to be set at pcm_8k , and if it is 16KHz, it needs to be set at pcm_16k . It can be left blank by default. |
Below is the code snippet required to open a websocket along with all the above mentioned parameters.
language = "en"
max_chunk_size = 3
vad_threshold = 0.5
disable_partial = "false"
audio_format = "pcm_16k"
import uuid
session_id = uuid.uuid4()
ws = websocket.create_connection(f"wss://voice.neuralspace.ai/voice/stream/live/transcribe/{language}/{TOKEN}/{session_id}?max_chunk_size={max_chunk_size}&vad_threshold={vad_threshold}&disable_partial={disable_partial}&format={audio_format}")
Getting Audio from the Mic
Before sending the audio data to the websocket it needs to be ensured that data is in the expected format. The API expects audio in 16KHz sampling rate, 16 bit little-endian, linear PCM format. If your audio is not in the expected format, it needs to be resampled.
For resampling audios, there are many libraries available. Some have been mentioned below:
librosa
for Python (FFmpeg is a pre-requisite)OfflineAudioContext
for Node JSJAVE
for JavaAudioFlinger
for resampling audio on Android.
Running the Python script below lets you start accessing the audio from the microphone of the device.
from queue import Queue
import pyaudio
import threading
q = Queue()
pa = pyaudio.PyAudio()
def listen(in_data, frame_count, time_info, status):
q.put(in_data)
print(len(in_data))
return (None, pyaudio.paContinue)
stream = pa.open(
rate=16000,
channels=1,
format=pyaudio.paInt16,
frames_per_buffer=4096,
input=True,
output=False,
stream_callback=listen
)
Send Audio Asynchronously
To send the audio from the microphone asynchronously through websocket, run the Python snippet below:
def send_audio(q, ws):
while True:
data = q.get()
ws.send_binary(data)
t = threading.Thread(target=send_audio, args=(q,ws))
t.start()
print("Listening and sending audio data through websocket.")
Receiving Results
To start receiving results from the API, run the snippet below:
import json
try:
while True:
resp = ws.recv()
resp = json.loads(resp)
text = resp['text']
print(text)
except KeyboardInterrupt as e:
print("Closing stream and websocket connection.")
stream.close()
ws.close()
The above returns the text as output received via the VoiveAI API in real-time. To stop, press ctrl+c
on the keyboard.
Parsing Results
You will get result as a json
from the connected websockets whenever the audio is processed in the backend. It would look something like the following:
{'full': True, 'stream_id': '...', 'chunk_id': 1, 'start_time': 0.1, 'end_time': 1.9, 'send_time': 1693821879.270821, 'text': '...'}
AND
{'full': False, 'stream_id': '...', 'chunk_id': 1, 'start_time': 0.1, 'end_time': 1.9, 'send_time': 1693821879.270821, 'text': '...'}
The result are either partial or full and can be identified by full
key in the json
. Partial results are transcription of the audio after every 1 second chunk. As soon as silence is detected, the full transcript is obtained. This is denoted by full
key being True
in the result.
In general practice, partial will not be very accurate compared to the full results. The reason being, for full the model processes a whole segment of audio which provides it with greater context. This, in turn, comes with a trade-off of higher latency for full results when compared to partial results.
- If
full
isTrue
, it is the full result and if itsFalse
, it is a partial result. chunk_id
can be used to determine the order as well as to align the audio chunk with the text response.start_time
andend_time
can also be used to determine the audio frames for the returned text.
Full Code Snippet
Copy the following code snippet into a file called streaming.py
and execute it:
import requests
import os
import websocket
# list languages
BASE = "https://voice.neuralspace.ai/api/v2/languages?type=stream"
API_KEY = os.environ["NS_API_KEY"]
assert API_KEY is not None
headers = {
"Authorization": API_KEY
}
response = requests.get(BASE, headers=headers)
assert response.status_code == 200
print(f"languages available are : {response.json()['data']['languages']}")
# short-lived token
DURATION = 600
TOKEN_URL = f"https://voice.neuralspace.ai/api/v2/token?duration={DURATION}"
response = requests.get(TOKEN_URL, headers=headers)
assert response.status_code == 200
TOKEN = response.json()['data']['token']
# create websocket connection
language = "en"
max_chunk_size = 3
vad_threshold = 0.5
disable_partial = "false"
audio_format = "pcm_16k"
import uuid
session_id = uuid.uuid4()
ws = websocket.create_connection(f"wss://voice.neuralspace.ai/voice/stream/live/transcribe/{language}/{TOKEN}/{session_id}?max_chunk_size={max_chunk_size}&vad_threshold={vad_threshold}&disable_partial={disable_partial}&format={audio_format}")
# get audio from microphone
from queue import Queue
import pyaudio
import threading
q = Queue()
pa = pyaudio.PyAudio()
def listen(in_data, frame_count, time_info, status):
q.put(in_data)
return (None, pyaudio.paContinue)
stream = pa.open(
rate=16000,
channels=1,
format=pyaudio.paInt16,
frames_per_buffer=4096,
input=True,
output=False,
stream_callback=listen
)
# send audio asynchronously using threading
def send_audio(q, ws):
try:
while True:
data = q.get()
ws.send_binary(data)
except KeyboardInterrupt as e:
print("closing sending audio thread.")
t = threading.Thread(target=send_audio, args=(q,ws))
t.start()
print("Listening and sending audio data through websocket.")
# recieve results
import json
try:
while True:
resp = ws.recv()
resp = json.loads(resp)
print(resp)
except KeyboardInterrupt as e:
print("Closing stream and websocket connection.")
stream.close()
ws.close()
Troubleshooting and FAQ
"Am I audible"? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.