Skip to main content

Integration

Objective

In this article you will learn how to integrate real-time transcription using native Python code. The objective is to show how different components of real-time transcription work and the concepts behind them. This is so you can draw parallels, for implementing it in your programming language of choice.

Requirements

In order to follow along with this guide, install the following Python packages:

info

Before installing Pyaudio you might need portaudio installed.

  • For Debian/Ubuntu Linux
apt install portaudio19-dev
  • For macOS
brew install portaudio

Concepts

Get List of Supported Languages

The first step is to verify if the language you are trying to transcribe real-time is supported by the VoiceAI platform. To check, you can fetch the languages supported using the following code:

import requests
import os

BASE = "https://voice.neuralspace.ai/api/v2/languages?type=stream"
API_KEY = os.environ["NS_API_KEY"]
assert API_KEY is not None

headers = {
"Authorization": API_KEY
}
response = requests.get(BASE, headers=headers)

assert response.status_code == 200

print(f"Languages supported: {response.json()['data']['languages']}")

For the purpose of this guide, all the examples will be based on the English (en) language.

Short-lived Token

The first step to start streaming with NeuralSpace APIs is to get a short lived token, i.e., a token which can be configured to expire in a certain amount of time.

DURATION = 600
TOKEN_URL = f"https://voice.neuralspace.ai/api/v2/token?duration={DURATION}"

response = requests.get(TOKEN_URL, headers=headers)

assert response.status_code == 200

session_token = response.json()['data']['token']

WebSockets

NeuralSpace uses WebSockets for real-time transcription. In order to use websockets, there are mainly 3 steps:

  1. create a connection to the websocket
  2. send chunks of audios to the websocket
  3. recieve the results in a different thread

In the given example the client package used in Python is websocket-client which can be installed by running the following in shell:

pip install websocket-client
info
  • It is not necessary to recieve data for every chunk of audio that is processed. In the Backend your data is consumed and if certain conditions are met e.g. minimum of 1 second of new data is collected then the backend will process the provided data and then return the result.
  • Hence, it is better to have separate threads for sending and recieving data from websockets, otherwise your main thread can get stuck on a blocking condition.

Setting the Options and creating the WebSocket

You can configure various parameters like the language, as well as advanced parameters like chunk size, silence thresholds, etc.

The different paramters that can be set are below:

ParameterRequiredDefaultTypeAll optionsDescription
languageYesNonestringen, en-in, ar, ar-ms, ar-eg, ar-sa, hi, tlSet audio language for real-time transcription.
max_chunk_sizeNo5intAny integer value >= min_chunk_sizeSpecifies the maximum size in seconds of audio chunks that are processed for full results even if the speaker has not stopped. Larger chunks might provide better transcription accuracy but will require speaker to finish speaking.
vad_thresholdNo0.5floatAny positive floating point valueThis threshold determines the sensitivity of the Voice Activity Detection (VAD) mechanism. VAD is used to detect the presence of speech in the audio stream. A lower value will make the detection more sensitive to noise, while a higher value might skip quieter speech parts.
disable_partialNoFalseboolTrue or FalseWhen set to True, the system will not return partial transcriptions (i.e., intermediate transcriptions before the speech segment is completed). When set to False, you might receive multiple transcriptions for the same speech segment, with each one being more complete than the previous.
formatNopcm_16kstringpcm_8k or pcm_16kIf the audio being streamed is of 8KHz then the format needs to be set at pcm_8k, and if it is 16KHz, it needs to be set at pcm_16k. It can be left blank by default.

Below is the code snippet required to open a websocket along with all the above mentioned parameters.

language = "en"

max_chunk_size = 3
vad_threshold = 0.5
disable_partial = "false"
audio_format = "pcm_16k"

import uuid
session_id = uuid.uuid4()
ws = websocket.create_connection(f"wss://voice.neuralspace.ai/voice/stream/live/transcribe/{language}/{TOKEN}/{session_id}?max_chunk_size={max_chunk_size}&vad_threshold={vad_threshold}&disable_partial={disable_partial}&format={audio_format}")

Getting Audio from the Mic

Before sending the audio data to the websocket it needs to be ensured that data is in the expected format. The API expects audio in 16KHz sampling rate, 16 bit little-endian, linear PCM format. If your audio is not in the expected format, it needs to be resampled.

For resampling audios, there are many libraries available. Some have been mentioned below:

Running the Python script below lets you start accessing the audio from the microphone of the device.

from queue import Queue
import pyaudio
import threading

q = Queue()
pa = pyaudio.PyAudio()

def listen(in_data, frame_count, time_info, status):
q.put(in_data)
print(len(in_data))
return (None, pyaudio.paContinue)

stream = pa.open(
rate=16000,
channels=1,
format=pyaudio.paInt16,
frames_per_buffer=4096,
input=True,
output=False,
stream_callback=listen
)

Send Audio Asynchronously

To send the audio from the microphone asynchronously through websocket, run the Python snippet below:

def send_audio(q, ws):
while True:
data = q.get()
ws.send_binary(data)

t = threading.Thread(target=send_audio, args=(q,ws))
t.start()
print("Listening and sending audio data through websocket.")

Receiving Results

To start receiving results from the API, run the snippet below:

import json
try:
while True:
resp = ws.recv()
resp = json.loads(resp)
text = resp['text']
print(text)
except KeyboardInterrupt as e:
print("Closing stream and websocket connection.")
stream.close()
ws.close()

The above returns the text as output received via the VoiveAI API in real-time. To stop, press ctrl+c on the keyboard.

Parsing Results

You will get result as a json from the connected websockets whenever the audio is processed in the backend. It would look something like the following:

{'full': True, 'stream_id': '...', 'chunk_id': 1, 'start_time': 0.1, 'end_time': 1.9, 'send_time': 1693821879.270821, 'text': '...'}
AND
{'full': False, 'stream_id': '...', 'chunk_id': 1, 'start_time': 0.1, 'end_time': 1.9, 'send_time': 1693821879.270821, 'text': '...'}

The result are either partial or full and can be identified by full key in the json. Partial results are transcription of the audio after every 1 second chunk. As soon as silence is detected, the full transcript is obtained. This is denoted by full key being True in the result.

In general practice, partial will not be very accurate compared to the full results. The reason being, for full the model processes a whole segment of audio which provides it with greater context. This, in turn, comes with a trade-off of higher latency for full results when compared to partial results.

info
  • If full is True, it is the full result and if its False, it is a partial result.
  • chunk_id can be used to determine the order as well as to align the audio chunk with the text response.
  • start_time and end_time can also be used to determine the audio frames for the returned text.

Full Code Snippet

Copy the following code snippet into a file called streaming.py and execute it:


import requests
import os
import websocket

# list languages
BASE = "https://voice.neuralspace.ai/api/v2/languages?type=stream"
API_KEY = os.environ["NS_API_KEY"]
assert API_KEY is not None

headers = {
"Authorization": API_KEY
}
response = requests.get(BASE, headers=headers)

assert response.status_code == 200

print(f"languages available are : {response.json()['data']['languages']}")

# short-lived token
DURATION = 600
TOKEN_URL = f"https://voice.neuralspace.ai/api/v2/token?duration={DURATION}"

response = requests.get(TOKEN_URL, headers=headers)

assert response.status_code == 200

TOKEN = response.json()['data']['token']

# create websocket connection
language = "en"

max_chunk_size = 3
vad_threshold = 0.5
disable_partial = "false"
audio_format = "pcm_16k"

import uuid
session_id = uuid.uuid4()
ws = websocket.create_connection(f"wss://voice.neuralspace.ai/voice/stream/live/transcribe/{language}/{TOKEN}/{session_id}?max_chunk_size={max_chunk_size}&vad_threshold={vad_threshold}&disable_partial={disable_partial}&format={audio_format}")

# get audio from microphone
from queue import Queue
import pyaudio
import threading

q = Queue()
pa = pyaudio.PyAudio()

def listen(in_data, frame_count, time_info, status):
q.put(in_data)
return (None, pyaudio.paContinue)

stream = pa.open(
rate=16000,
channels=1,
format=pyaudio.paInt16,
frames_per_buffer=4096,
input=True,
output=False,
stream_callback=listen
)

# send audio asynchronously using threading
def send_audio(q, ws):
try:
while True:
data = q.get()
ws.send_binary(data)
except KeyboardInterrupt as e:
print("closing sending audio thread.")

t = threading.Thread(target=send_audio, args=(q,ws))
t.start()
print("Listening and sending audio data through websocket.")

# recieve results
import json
try:
while True:
resp = ws.recv()
resp = json.loads(resp)
print(resp)
except KeyboardInterrupt as e:
print("Closing stream and websocket connection.")
stream.close()
ws.close()

Troubleshooting and FAQ

"Am I audible"? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.