Skip to main content

Speaker Diarization

What is speaker diarization?

Speaker diarization is the process of distinguishing and segmenting audio according to the different speakers present. In other words, it aims to answer the question: "Who spoke when?" in an audio recording. This task is important in various applications such as call analytics, meeting summarization, and audio indexing.

For instance, consider a meeting recording with multiple participants. Speaker diarization would segment the audio stream, indicating the points in time when each participant starts and stops speaking. This segmented information can then be used to produce a transcription that attributes each segment of speech to the appropriate speaker.

File Transcription Job with Speaker Diarization

Copy and paste the below curl request on your terminal to start a transcription using the API. Fill the variables with the appropriate values, as mentioned in the overview.

curl --location 'https://voice.neuralspace.ai/api/v1/jobs' \
--header 'Authorization: {{API_KEY}}' \
--form 'files=@"{{LOCAL_AUDIO_FILE_PATH}}"' \
--form 'config="{\"file_transcription\":{\"language_id\":\"{{LANG}}\", \"mode\":\"{{MODE}}\"},
\"speaker_diarization\":{"num_speakers": {{NUM_SPEAKERS}}, "overrides": {\"clustering\": {\"threshold\": {{CLUSTERING_THRESHOLD}}}}}}"'
{
"success": true,
"message": "Job created successfully",
"data": {
"jobId": "281f8662-cdc3-4c76-82d0-e7d14af52c46"
}
}
Speaker diarization Configuration

In the above request, speaker_diarization is an extra configuration that is being passed.

Each audio segment is assigned to one speaker. NUM_SPEAKERS determines the number of such candidate speakers. If user doesn't know it beforehand, he can test several clustering thresholds by using CLUSTERING_THRESHOLD parameter. It ranges from detecting low number of speakers to more number of speakers.

Note: If NUM_SPEAKERS is passed, then the CLUSTERING_THRESHOLD parameter will be ignored.

Fetch Transcription and Speaker Diarization Results

When you pass the jobId (received in response to the transcription API) to the API below, it fetches the status and results of the job.

curl --location 'https://voice.neuralspace.ai/api/v1/jobs/{{jobId}}' \
--header 'Authorization: {{API_KEY}}'

The response of the request above appears as follows:

{
...
"data": {
...
"result": {
"transcription": {
"segments": [
{
"startTime": 6.6909375,
"endTime": 10.302187500000002,
"text": "We've been at this for hours now. Have you found anything useful in any of those books?",
"speaker": "SPEAKER_02"
},
{
"startTime": 10.690312500000001,
"endTime": 14.588437500000001,
"text": "Not a single thing, Lewis. I'm sure that there must be something in this library.",
"speaker": "SPEAKER_01"
},
{
"startTime": 14.740312500000002,
"endTime": 16.545937499999997,
"text": "It's not like there's nothing left to be discovered.",
"speaker": "SPEAKER_01"
}
]
}
...
}
}
}
info

When speaker diarization is enabled along with translation or sentiment analysis, you also get per segment translations or sentiments. These segments are generated by speaker diarization and the list of segments is the same for all these features.

In the response above, the speakers are tagged and what they said are also returned, along with the start and end times of their speech.

Troubleshooting and FAQ

Speaker issue? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.