Speaker Diarization
Speaker diarization is the process of distinguishing and segmenting audio according to the different speakers present. In other words, it aims to answer the question: "Who spoke when?" in an audio recording. This task is important in various applications such as call analytics, meeting summarization, and audio indexing.
For instance, consider a meeting recording with multiple participants. Speaker diarization would segment the audio stream, indicating the points in time when each participant starts and stops speaking. This segmented information can then be used to produce a transcription that attributes each segment of speech to the appropriate speaker.
File Transcription Job with Speaker Diarization
- API
- Python SDK
Copy and paste the below curl request on your terminal to start a transcription using the API. Fill the variables with the appropriate values, as mentioned in the overview.
curl --location 'https://voice.neuralspace.ai/api/v2/jobs' \
--header 'Authorization: {{API_KEY}}' \
--form 'files=@"{{LOCAL_AUDIO_FILE_PATH}}"' \
--form 'config="{\"file_transcription\":{\"language_id\":\"{{LANG}}\", \"mode\":\"{{MODE}}\"},
\"speaker_diarization\":{\"mode\": {{MODE}}, \"num_speakers\": {{NUM_SPEAKERS}}, \"overrides\": {\"clustering\": {\"threshold\": {{CLUSTERING_THRESHOLD}}}}}}"'
The mode
, num_speakers
, and overrides
parameters inside speaker_diarization
can be set while sending the transcription request. Note that all of these parameters are optional.
{
"success": true,
"message": "Job created successfully",
"data": {
"jobId": "281f8662-cdc3-4c76-82d0-e7d14af52c46"
}
}
Once installation steps for the package are complete, execute the below mentioned python code snippet:
import neuralspace as ns
vai = ns.VoiceAI()
# or,
# vai = ns.VoiceAI(api_key='YOUR_API_KEY')
# Setup job configuration
config = {
"file_transcription": {
"language_id": "en",
"mode": "advanced",
"number_formatting": "words",
},
"speaker_diarization": {
"mode": MODE
"num_speakers" : NUM_SPEAKERS,
"overrides": {
"clustering": {
"threshold": THRESHOLD
}
}
}
}
# Create a new file transcription job
job_id = vai.transcribe(file='path/to/audio.wav', config=config)
print(job_id)
The mode
, num_speakers
, and overrides
parameters can be set while sending the transcription request. Note that all of these parameters are optional.
The response would be the same as mentioned in overview and would look like below:
6abe4f35-8220-4981-95c7-3b040d9b86d1
Speaker Diarization Parameter | Required | Default | All Options | Description |
---|---|---|---|---|
mode | No | speakers | speakers or channels | Setting the mode to speakers identifies speakers automatically; while setting it as channels lets you identify speakers on the basis of channels when a multi-channel audio is provided and there is a different speaker in each channel. |
num_speakers | No | None | Any integer value > 0 . | Can be provided if you already know the number of speakers in the audio, to increase accuracy. threshold is ignored if this parameter is provided. |
threshold | No | 0 | Any integer value between -20 and 20 . | Set this to tweak the sensitivity of the speaker diarization. If there are a lot of speakers, a higher value is recommended. |
VoiceAI has built-in support for audios in which there are more than one channel and there is a different speaker and audio in each channel. This is usually useful for call centre recordings where the left channel could have the conversation from the agent side while the right channel could have the conversation from the customer side. If your audio is something similar, you can just set the mode
in speaker_diarization
config as channels
. VoiceAI will automatically handle the speakers and their corresponding speech, and return transcripts accordingly.
Fetch Transcription and Speaker Diarization Results
- API
- Python SDK
When you pass the jobId
(received in response to the transcription API) to the API below, it fetches the status and results of the job.
curl --location 'https://voice.neuralspace.ai/api/v2/jobs/{{jobId}}' \
--header 'Authorization: {{API_KEY}}'
Using the jobId
(received in response to the transcription API) the snippet below can be executed to fetch the status and results of the job.
result = vai.get_job_status(jobId)
print(result)
The response of the request above appears as follows:
{
...
"data": {
...
"result": {
"transcription": {
"segments": [
{
"startTime": 6.6909375,
"endTime": 10.302187500000002,
"text": "We've been at this for hours now. Have you found anything useful in any of those books?",
"speaker": "SPEAKER_02",
"channel": 0
},
{
"startTime": 10.690312500000001,
"endTime": 14.588437500000001,
"text": "Not a single thing, Lewis. I'm sure that there must be something in this library.",
"speaker": "SPEAKER_01",
"channel": 0
},
...
]
}
...
}
}
}
When channels
mode is enabled, the diarization result looks the same as above. But the channel-wise transcript contains transcripts for both channels individually as follows:
{
"success": true,
"message": "Data fetched successfully",
"data": {
...
"result": {
"transcription": {
"channels": {
"0": {
"transcript": "We've been at this for hours now. Have you found anything useful in any of those books? ...",
"timestamps": [
{
"word": "We've",
"start": 6.65,
"end": 6.99,
"conf": 0.8
},
{
"word": "been",
"start": 6.99,
"end": 7.09,
"conf": 0.99
},
...
]
},
"1": {
"transcript": "Not a single thing, Lewis. I'm sure that there must be something in this library. ...",
"timestamps": [
{
"word": "Not",
"start": 10.77,
"end": 10.89,
"conf": 1
},
{
"word": "a",
"start": 10.89,
"end": 11.09,
"conf": 1
},
...
]
}
}
}
}
}
}
When speaker diarization is enabled along with translation or sentiment analysis, you also get per segment translations or sentiments. These segments are generated by speaker diarization and the list of segments is the same for all these features.
In the response above, the speakers are tagged and what they said are also returned, along with the start and end times of their speech.
Troubleshooting and FAQ
Speaker issue? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.