Speaker Diarization

Speaker diarization is the process of distinguishing and segmenting audio according to the different speakers present. In other words, it aims to answer the question: "Who spoke when?" in an audio recording. This task is important in various applications such as call analytics, meeting summarization, and audio indexing.

For instance, consider a meeting recording with multiple participants. Speaker diarization would segment the audio stream, indicating the points in time when each participant starts and stops speaking. This segmented information can then be used to produce a transcription that attributes each segment of speech to the appropriate speaker.

File Transcription Job with Speaker Diarization

API
Python SDK

Copy and paste the below curl request on your terminal to start a transcription using the API. Fill the variables with the appropriate values, as mentioned in the overview.

curl --location 'https://voice.neuralspace.ai/api/v2/jobs' \
--header 'Authorization: {{API_KEY}}' \
--form 'files=@"{{LOCAL_AUDIO_FILE_PATH}}"' \
--form 'config="{\"file_transcription\":{\"language_id\":\"{{LANG}}\", \"mode\":\"{{MODE}}\"},
\"speaker_diarization\":{\"mode\": {{MODE}}, \"num_speakers\": {{NUM_SPEAKERS}}, \"overrides\": {\"clustering\": {\"threshold\": {{CLUSTERING_THRESHOLD}}}}}}"'

The mode, num_speakers, and overrides parameters inside speaker_diarization can be set while sending the transcription request. Note that all of these parameters are optional.

{
    "success": true,
    "message": "Job created successfully",
    "data": {
        "jobId": "281f8662-cdc3-4c76-82d0-e7d14af52c46"
    }
}

Once installation steps for the package are complete, execute the below mentioned python code snippet:

import neuralspace as ns

vai = ns.VoiceAI()
# or,
# vai = ns.VoiceAI(api_key='YOUR_API_KEY')

# Setup job configuration
config = {
    "file_transcription": {
        "language_id": "en",
        "mode": "advanced",
        "number_formatting": "words",
    },
    "speaker_diarization": {
        "mode": MODE
        "num_speakers" : NUM_SPEAKERS,
        "overrides": {
            "clustering": {
                "threshold": THRESHOLD
            }
        }
    }
}

# Create a new file transcription job
job_id = vai.transcribe(file='path/to/audio.wav', config=config)
print(job_id)

The mode, num_speakers, and overrides parameters can be set while sending the transcription request. Note that all of these parameters are optional.

The response would be the same as mentioned in overview and would look like below:

6abe4f35-8220-4981-95c7-3b040d9b86d1

Speaker Diarization Parameter	Required	Default	All Options	Description
mode	No	`speakers`	`speakers` or `channels`	Setting the mode to `speakers` identifies speakers automatically; while setting it as `channels` lets you identify speakers on the basis of channels when a multi-channel audio is provided and there is a different speaker in each channel.
num_speakers	No	None	Any integer value > `0`.	Can be provided if you already know the number of speakers in the audio, to increase accuracy. `threshold` is ignored if this parameter is provided.
threshold	No	`0`	Any integer value between `-20` and `20`.	Set this to tweak the sensitivity of the speaker diarization. If there are a lot of speakers, a higher value is recommended.

Multichannel support

VoiceAI has built-in support for audios in which there are more than one channel and there is a different speaker and audio in each channel. This is usually useful for call centre recordings where the left channel could have the conversation from the agent side while the right channel could have the conversation from the customer side. If your audio is something similar, you can just set the mode in speaker_diarization config as channels. VoiceAI will automatically handle the speakers and their corresponding speech, and return transcripts accordingly.

Fetch Transcription and Speaker Diarization Results

API
Python SDK

When you pass the jobId (received in response to the transcription API) to the API below, it fetches the status and results of the job.

curl --location 'https://voice.neuralspace.ai/api/v2/jobs/{{jobId}}' \
--header 'Authorization: {{API_KEY}}'

Using the jobId (received in response to the transcription API) the snippet below can be executed to fetch the status and results of the job.

result = vai.get_job_status(jobId) 
print(result)

The response of the request above appears as follows:

{
    ...
    "data": {
        ...
        "result": {
            "transcription": {
                "segments": [
                    {
                        "startTime": 6.6909375,
                        "endTime": 10.302187500000002,
                        "text": "We've been at this for hours now. Have you found anything useful in any of those books?",
                        "speaker": "SPEAKER_02",
                        "channel": 0
                    },
                    {
                        "startTime": 10.690312500000001,
                        "endTime": 14.588437500000001,
                        "text": "Not a single thing, Lewis. I'm sure that there must be something in this library.",
                        "speaker": "SPEAKER_01",
                        "channel": 0
                    },
                    ...
                ]
            }
            ...
        }
    }
}

When channels mode is enabled, the diarization result looks the same as above. But the channel-wise transcript contains transcripts for both channels individually as follows:

{
    "success": true,
    "message": "Data fetched successfully",
    "data": {
        ...
        "result": {
            "transcription": {
                "channels": {
                    "0": {
                        "transcript": "We've been at this for hours now. Have you found anything useful in any of those books? ...",
                        "timestamps": [
                            {
                                "word": "We've",
                                "start": 6.65,
                                "end": 6.99,
                                "conf": 0.8
                            },
                            {
                                "word": "been",
                                "start": 6.99,
                                "end": 7.09,
                                "conf": 0.99
                            },
                            ...
                        ]
                    },
                    "1": {
                        "transcript": "Not a single thing, Lewis. I'm sure that there must be something in this library. ...",
                        "timestamps": [
                            {
                                "word": "Not",
                                "start": 10.77,
                                "end": 10.89,
                                "conf": 1
                            },
                            {
                                "word": "a",
                                "start": 10.89,
                                "end": 11.09,
                                "conf": 1
                            },
                            ...
                        ]
                    }
                }
            }
        }
    }
}

info

When speaker diarization is enabled along with translation or sentiment analysis, you also get per segment translations or sentiments. These segments are generated by speaker diarization and the list of segments is the same for all these features.

In the response above, the speakers are tagged and what they said are also returned, along with the start and end times of their speech.

Troubleshooting and FAQ

Speaker issue? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.

Speaker Diarization

File Transcription Job with Speaker Diarization​

Fetch Transcription and Speaker Diarization Results​

Troubleshooting and FAQ​

File Transcription Job with Speaker Diarization

Fetch Transcription and Speaker Diarization Results

Troubleshooting and FAQ