Word-Level Timestamps

What are timestamps?

Timestamps refer to specific time in an audio or video clip when a particular word or phrase was spoken. Below is an example of what they could look like:

00:00:01,000 --> 00:00:04,000
This is the text for the audio between the first and fourth second.

In the example above, the timestamp 00:00:01,000 --> 00:00:04,000 denotes the time during which the phrase This is the text for the audio between the first and fourth second. was spoken.

Word-Level Timestamps

Timestamps can be more accurate if you know exactly when each word was spoken. Instead of having a timestamp for an entire sentence or paragraph, each word is given its precise start and end time, indicating when it occurs in the audio or video timeline. Below is an example of the same:

"timestamps": [
    {
        "word": "Towards",
        "start": 1.92,
        "end": 2.2
    },
    {
        "word": "the",
        "start": 2.2,
        "end": 2.42
    },
    {
        "word": "night",
        "start": 2.42,
        "end": 2.66
    }
]

In the example above, we know the word night was spoken exactly between 2.42-2.66 seconds, which gives more control over how the transcript can be used.

Why are timestamps useful?

Timestamps in a transcript serve several practical purposes, enhancing the utility and accessibility of the transcribed content. Here are some of the ways in which timestamps are useful:

Navigation and Searching: Timestamps allow users to quickly and accurately navigate to or search for specific sections of an audio or video file. For example, if someone wants to listen to a specific quote or segment, they can use the timestamp to jump directly to that point in the recording.
Clarifying and Editing: Timestamps also help for clarifying ambigous sections by indicating the exact time you can refer to in the audio. And for editing or modifying segments without having to go through the whole audio again.
Synchronization: If someone is creating subtitles or captions for a video, timestamps ensure that the text appears in sync with the spoken content.
Research and Analysis: Researchers or journalists might use timestamps to reference specific parts of an interview or conversation. Timestamps allow for a methodical approach, making it easier to cite or revisit parts of the conversation.
Accessibility: For people with hearing impairments, synchronized transcripts (like captions) can be essential when consuming video content. Timestamps ensure the alignment of text with the corresponding audio.
Segmentation: Timestamps can aid in segmenting a long recording into smaller, topical chunks. For podcasts or lectures, this can be helpful in creating a table of contents or chapter breaks.
Verification: In legal or official contexts, timestamps help verify when specific statements were made, ensuring the chronological accuracy of events or conversations.
Enhanced Engagement: In some online platforms or learning management systems, clickable timestamps can be integrated, allowing listeners or viewers to jump to topics of interest, enhancing user engagement and experience.

In essence, timestamps in transcripts bridge the gap between text and the rich context provided by audio or video, making the content more accessible, navigable, and useful.

How to get timestamps from VoiceAI?

API and SDK
UI

VoiceAI returns timestamps by default for audio files when using the API or the SDK. Below is an example of what the response looks like:

{
    ...
    "data": {
        ...
        "result": {
            "transcription": {
                "channels": {
                    "0": {
                        ...
                        "timestamps": [
                            {
                                "word": "Towards",
                                "start": 1.92,
                                "end": 2.2,
                                "conf": 1
                            },
                            {
                                "word": "the",
                                "start": 2.2,
                                "end": 2.42,
                                "conf": 1
                            },
                            {
                                "word": "night",
                                "start": 2.42,
                                "end": 2.66,
                                "conf": 1
                            },
                            {
                                "word": "before",
                                "start": 2.66,
                                "end": 3.02,
                                "conf": 0.69
                            },
                            ...
                        ]
                    }
                }
            }
        }
    }
}

In the above example, word refers to the word that was spoken in the audio, start and end refers to the stand and end time of the word. conf refers to the confidence that the model has regarding that prediction.

To get timestamps for transcripts when using the UI, use the download option present in the job result page.

If you download the transcript as a JSON file, you can get the word-level timestamps, like the below example:

{
    ...
    "data": {
        ...
        "result": {
            "transcription": {
                "channels": {
                    "0": {
                        ...
                        "timestamps": [
                            {
                                "word": "Towards",
                                "start": 1.92,
                                "end": 2.2,
                                "conf": 1
                            },
                            {
                                "word": "the",
                                "start": 2.2,
                                "end": 2.42,
                                "conf": 1
                            },
                            {
                                "word": "night",
                                "start": 2.42,
                                "end": 2.66,
                                "conf": 1
                            },
                            {
                                "word": "before",
                                "start": 2.66,
                                "end": 3.02,
                                "conf": 0.69
                            },
                            ...
                        ]
                    }
                }
            }
        }
    }
}

Download your transcription as an SRT file to retrieve subtitles that can be synced with your video content. Our subtitles are compatible with major platforms including YouTube. Below is an example of an SRT file output:

1
00:00:01,000 --> 00:00:03,500
Hello, welcome to our video!

2
00:00:04,000 --> 00:00:06,000
Today, we'll discuss the benefits of timestamps.

3
00:00:06,500 --> 00:00:10,000
Timestamps are crucial for synchronizing subtitles with video.

4
00:00:10,500 --> 00:00:12,500
Stay tuned for more tips!

tip

You can also configure guidelines like the number of lines, duration, and number of character to conveniently use transcripts as subtitles. Check out subtitle guidelines for more.

Troubleshooting and FAQ

Wrong timing? Check out our FAQ page. If you still need help, feel free to reach out to us directly at support@neuralspace.ai or join our Slack community.

Word-Level Timestamps

What are timestamps?​

Word-Level Timestamps​

Why are timestamps useful?​

How to get timestamps from VoiceAI?​

Troubleshooting and FAQ​

What are timestamps?

Word-Level Timestamps

Why are timestamps useful?

How to get timestamps from VoiceAI?

Troubleshooting and FAQ