Skip to main content

Speaker Diarization

Speaker diarization is the process of distinguishing and segmenting audio according to the different speakers present. In other words, it aims to answer the question: "Who spoke when?" in an audio recording. This task is important in various applications such as call analytics, meeting summarization, and audio indexing.

For instance, consider a meeting recording with multiple participants. Speaker diarization would segment the audio stream, indicating the points in time when each participant starts and stops speaking. This segmented information can then be used to produce a transcription that attributes each segment of speech to the appropriate speaker.

File Transcription Job with Speaker Diarization

Copy and paste the below curl request on your terminal to start a transcription using the API. Fill the variables with the appropriate values, as mentioned in the overview.

curl --location 'https://voice.neuralspace.ai/api/v2/jobs' \
--header 'Authorization: {{API_KEY}}' \
--form 'files=@"{{LOCAL_AUDIO_FILE_PATH}}"' \
--form 'config="{\"file_transcription\":{\"language_id\":\"{{LANG}}\", \"mode\":\"{{MODE}}\"},
\"speaker_diarization\":{\"mode\": {{MODE}}, \"num_speakers\": {{NUM_SPEAKERS}}, \"overrides\": {\"clustering\": {\"threshold\": {{CLUSTERING_THRESHOLD}}}}}}"'

The mode, num_speakers, and overrides parameters inside speaker_diarization can be set while sending the transcription request. Note that all of these parameters are optional.

{
"success": true,
"message": "Job created successfully",
"data": {
"jobId": "281f8662-cdc3-4c76-82d0-e7d14af52c46"
}
}
Speaker Diarization ParameterRequiredDefaultAll OptionsDescription
modeNospeakersspeakers or channelsSetting the mode to speakers identifies speakers automatically; while setting it as channels lets you identify speakers on the basis of channels when a multi-channel audio is provided and there is a different speaker in each channel.
num_speakersNoNoneAny integer value > 0.Can be provided if you already know the number of speakers in the audio, to increase accuracy. threshold is ignored if this parameter is provided.
thresholdNo0Any integer value between -20 and 20.Set this to tweak the sensitivity of the speaker diarization. If there are a lot of speakers, a higher value is recommended.
Multichannel support

VoiceAI has built-in support for audios in which there are more than one channel and there is a different speaker and audio in each channel. This is usually useful for call centre recordings where the left channel could have the conversation from the agent side while the right channel could have the conversation from the customer side. If your audio is something similar, you can just set the mode in speaker_diarization config as channels. VoiceAI will automatically handle the speakers and their corresponding speech, and return transcripts accordingly.

Fetch Transcription and Speaker Diarization Results

When you pass the jobId (received in response to the transcription API) to the API below, it fetches the status and results of the job.

curl --location 'https://voice.neuralspace.ai/api/v2/jobs/{{jobId}}' \
--header 'Authorization: {{API_KEY}}'

The response of the request above appears as follows:

{
...
"data": {
...
"result": {
"transcription": {
"segments": [
{
"startTime": 6.6909375,
"endTime": 10.302187500000002,
"text": "We've been at this for hours now. Have you found anything useful in any of those books?",
"speaker": "SPEAKER_02",
"channel": 0
},
{
"startTime": 10.690312500000001,
"endTime": 14.588437500000001,
"text": "Not a single thing, Lewis. I'm sure that there must be something in this library.",
"speaker": "SPEAKER_01",
"channel": 0
},
...
]
}
...
}
}
}

When channels mode is enabled, the diarization result looks the same as above. But the channel-wise transcript contains transcripts for both channels individually as follows:

{
"success": true,
"message": "Data fetched successfully",
"data": {
...
"result": {
"transcription": {
"channels": {
"0": {
"transcript": "We've been at this for hours now. Have you found anything useful in any of those books? ...",
"timestamps": [
{
"word": "We've",
"start": 6.65,
"end": 6.99,
"conf": 0.8
},
{
"word": "been",
"start": 6.99,
"end": 7.09,
"conf": 0.99
},
...
]
},
"1": {
"transcript": "Not a single thing, Lewis. I'm sure that there must be something in this library. ...",
"timestamps": [
{
"word": "Not",
"start": 10.77,
"end": 10.89,
"conf": 1
},
{
"word": "a",
"start": 10.89,
"end": 11.09,
"conf": 1
},
...
]
}
}
}
}
}
}

info

When speaker diarization is enabled along with translation or sentiment analysis, you also get per segment translations or sentiments. These segments are generated by speaker diarization and the list of segments is the same for all these features.

In the response above, the speakers are tagged and what they said are also returned, along with the start and end times of their speech.

Troubleshooting and FAQ

Speaker issue? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.