Integration

Objective

In this article you will learn how to integrate real-time transcription using native Python code. The objective is to show how different components of real-time transcription work and the concepts behind them. This is so you can draw parallels, for implementing it in your programming language of choice.

Requirements

In order to follow along with this guide, install the following Python packages:

info

Before installing Pyaudio you might need portaudio installed.

For Debian/Ubuntu Linux

apt install portaudio19-dev

For macOS

brew install portaudio

Concepts

Get List of Supported Languages

The first step is to verify if the language you are trying to transcribe real-time is supported by the VoiceAI platform. To check, you can fetch the languages supported using the following code:

import requests
import os

BASE = "https://voice.neuralspace.ai/api/v2/languages?type=stream"
API_KEY = os.environ["NS_API_KEY"]
assert API_KEY is not None

headers = {
    "Authorization": API_KEY
}
response = requests.get(BASE, headers=headers)

assert response.status_code == 200

print(f"Languages supported: {response.json()['data']['languages']}")

For the purpose of this guide, all the examples will be based on the English (en) language.

Short-lived Token

The first step to start streaming with NeuralSpace APIs is to get a short lived token, i.e., a token which can be configured to expire in a certain amount of time.

DURATION = 600
TOKEN_URL = f"https://voice.neuralspace.ai/api/v2/token?duration={DURATION}"

response = requests.get(TOKEN_URL, headers=headers)

assert response.status_code == 200

session_token = response.json()['data']['token']

WebSockets

NeuralSpace uses WebSockets for real-time transcription. In order to use websockets, there are mainly 3 steps:

create a connection to the websocket
send chunks of audios to the websocket
recieve the results in a different thread

In the given example the client package used in Python is websocket-client which can be installed by running the following in shell:

pip install websocket-client

info

It is not necessary to recieve data for every chunk of audio that is processed. In the Backend your data is consumed and if certain conditions are met e.g. minimum of 1 second of new data is collected then the backend will process the provided data and then return the result.
Hence, it is better to have separate threads for sending and recieving data from websockets, otherwise your main thread can get stuck on a blocking condition.

Setting the Options and creating the WebSocket

You can configure various parameters like the language, as well as advanced parameters like chunk size, silence thresholds, etc.

The different paramters that can be set are below:

Parameter	Required	Default	Type	All options	Description
language	Yes	None	`string`	`en`, `en-in`, `ar`, `ar-ms`, `ar-eg`, `ar-sa`, `hi`, `tl`	Set audio language for real-time transcription.
max_chunk_size	No	5	`int`	Any integer value >= `min_chunk_size`	Specifies the maximum size in seconds of audio chunks that are processed for full results even if the speaker has not stopped. Larger chunks might provide better transcription accuracy but will require speaker to finish speaking.
vad_threshold	No	0.5	`float`	Any positive floating point value	This threshold determines the sensitivity of the Voice Activity Detection (VAD) mechanism. VAD is used to detect the presence of speech in the audio stream. A lower value will make the detection more sensitive to noise, while a higher value might skip quieter speech parts.
disable_partial	No	False	`bool`	`True` or `False`	When set to True, the system will not return partial transcriptions (i.e., intermediate transcriptions before the speech segment is completed). When set to False, you might receive multiple transcriptions for the same speech segment, with each one being more complete than the previous.
format	No	pcm_16k	`string`	`pcm_8k` or `pcm_16k`	If the audio being streamed is of 8KHz then the format needs to be set at `pcm_8k`, and if it is 16KHz, it needs to be set at `pcm_16k`. It can be left blank by default.

Below is the code snippet required to open a websocket along with all the above mentioned parameters.

language = "en"

max_chunk_size = 3
vad_threshold = 0.5
disable_partial = "false"
audio_format = "pcm_16k"

import uuid
session_id = uuid.uuid4()
ws = websocket.create_connection(f"wss://voice.neuralspace.ai/voice/stream/live/transcribe/{language}/{TOKEN}/{session_id}?max_chunk_size={max_chunk_size}&vad_threshold={vad_threshold}&disable_partial={disable_partial}&format={audio_format}")

Getting Audio from the Mic

Before sending the audio data to the websocket it needs to be ensured that data is in the expected format. The API expects audio in 16KHz sampling rate, 16 bit little-endian, linear PCM format. If your audio is not in the expected format, it needs to be resampled.

For resampling audios, there are many libraries available. Some have been mentioned below:

librosa for Python (FFmpeg is a pre-requisite)
OfflineAudioContext for Node JS
JAVE for Java
AudioFlinger for resampling audio on Android.

Running the Python script below lets you start accessing the audio from the microphone of the device.

from queue import Queue
import pyaudio
import threading

q = Queue()
pa = pyaudio.PyAudio()

def listen(in_data, frame_count, time_info, status):
    q.put(in_data)
    print(len(in_data))
    return (None, pyaudio.paContinue)

stream = pa.open(
    rate=16000,
    channels=1,
    format=pyaudio.paInt16,
    frames_per_buffer=4096,
    input=True,
    output=False,
    stream_callback=listen
)

Send Audio Asynchronously

To send the audio from the microphone asynchronously through websocket, run the Python snippet below:

def send_audio(q, ws):
    while True:
        data = q.get()
        ws.send_binary(data)

t = threading.Thread(target=send_audio, args=(q,ws))
t.start()
print("Listening and sending audio data through websocket.")

Receiving Results

To start receiving results from the API, run the snippet below:

import json
try:
    while True:
        resp = ws.recv()
        resp = json.loads(resp)
        text = resp['text']
        print(text)
except KeyboardInterrupt as e:
    print("Closing stream and websocket connection.")
    stream.close()
    ws.close()

The above returns the text as output received via the VoiveAI API in real-time. To stop, press ctrl+c on the keyboard.

Parsing Results

You will get result as a json from the connected websockets whenever the audio is processed in the backend. It would look something like the following:

{'full': True, 'stream_id': '...', 'chunk_id': 1, 'start_time': 0.1, 'end_time': 1.9, 'send_time': 1693821879.270821, 'text': '...'}
AND
{'full': False, 'stream_id': '...', 'chunk_id': 1, 'start_time': 0.1, 'end_time': 1.9, 'send_time': 1693821879.270821, 'text': '...'}

The result are either partial or full and can be identified by full key in the json. Partial results are transcription of the audio after every 1 second chunk. As soon as silence is detected, the full transcript is obtained. This is denoted by full key being True in the result.

In general practice, partial will not be very accurate compared to the full results. The reason being, for full the model processes a whole segment of audio which provides it with greater context. This, in turn, comes with a trade-off of higher latency for full results when compared to partial results.

info

If full is True, it is the full result and if its False, it is a partial result.
chunk_id can be used to determine the order as well as to align the audio chunk with the text response.
start_time and end_time can also be used to determine the audio frames for the returned text.

Full Code Snippet

Copy the following code snippet into a file called streaming.py and execute it:

import requests
import os
import websocket

# list languages    
BASE = "https://voice.neuralspace.ai/api/v2/languages?type=stream"
API_KEY = os.environ["NS_API_KEY"]
assert API_KEY is not None

headers = {
    "Authorization": API_KEY
}
response = requests.get(BASE, headers=headers)

assert response.status_code == 200

print(f"languages available are : {response.json()['data']['languages']}")

# short-lived token
DURATION = 600
TOKEN_URL = f"https://voice.neuralspace.ai/api/v2/token?duration={DURATION}"

response = requests.get(TOKEN_URL, headers=headers)

assert response.status_code == 200

TOKEN = response.json()['data']['token']

# create websocket connection
language = "en"

max_chunk_size = 3
vad_threshold = 0.5
disable_partial = "false"
audio_format = "pcm_16k"

import uuid
session_id = uuid.uuid4()
ws = websocket.create_connection(f"wss://voice.neuralspace.ai/voice/stream/live/transcribe/{language}/{TOKEN}/{session_id}?max_chunk_size={max_chunk_size}&vad_threshold={vad_threshold}&disable_partial={disable_partial}&format={audio_format}")

# get audio from microphone
from queue import Queue
import pyaudio
import threading

q = Queue()
pa = pyaudio.PyAudio()

def listen(in_data, frame_count, time_info, status):
    q.put(in_data)
    return (None, pyaudio.paContinue)

stream = pa.open(
    rate=16000,
    channels=1,
    format=pyaudio.paInt16,
    frames_per_buffer=4096,
    input=True,
    output=False,
    stream_callback=listen
)

# send audio asynchronously using threading
def send_audio(q, ws):
    try:
        while True:
            data = q.get()
            ws.send_binary(data)
    except KeyboardInterrupt as e:
        print("closing sending audio thread.")

t = threading.Thread(target=send_audio, args=(q,ws))
t.start()
print("Listening and sending audio data through websocket.")

# recieve results
import json
try:
    while True:
        resp = ws.recv()
        resp = json.loads(resp)
        print(resp)
except KeyboardInterrupt as e:
    print("Closing stream and websocket connection.")
    stream.close()
    ws.close()

Troubleshooting and FAQ

"Am I audible"? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.

Objective​

Requirements​

Concepts​

Get List of Supported Languages​

Short-lived Token​

WebSockets​

Setting the Options and creating the WebSocket​

Getting Audio from the Mic​

Send Audio Asynchronously​

Receiving Results​

Parsing Results​

Full Code Snippet​

Troubleshooting and FAQ​

Objective

Requirements

Concepts

Get List of Supported Languages

Short-lived Token

WebSockets

Setting the Options and creating the WebSocket

Getting Audio from the Mic

Send Audio Asynchronously

Receiving Results

Parsing Results

Full Code Snippet

Troubleshooting and FAQ