Transcript

The Transcript class is designed to handle text-to-speech outputs generated by machine learning models, such as OpenAI’s Whisper. It supports outputs that include word-level timestamps.

Constructing a Transcript

You typically create a Transcript instance from JSON data. The JSON should adhere to the following structure:


type Captions = {
	token: string; 	// The spoken word
	start: number; 	// The start in milliseconds
	stop: number;	// The stop in milliseconds
}[][];

The JSON structure is a 3-dimensional array, where the first level represents sentences, and each sentence contains a list of words or tokens. This structure preserves the semantic grouping of words.

To create a Transcript from this JSON, use the following:


import * as core from '@diffusionstudio/core';
 
const transcript = core.Transcript.fromJSON(captions); // `captions` is of type Captions
 
// or
 
const transcript = await core.Transcript.from('https://.../captions.json'); // to load from a remote JSON file

Manual Construction

You can also manually create a Transcript instance:


import * as core from '@diffusionstudio/core';
 
const transcript = new core.Transcript([
  new core.WordGroup([
    new core.Word('Hello', 0, 300),
    new core.Word('World', 320, 600),
  ])
]);

Utility Methods

The Transcript class provides several utility methods:


transcript.optimize();
transcript.toSRT();
transcript.slice(20);

optimize(): Adjusts the timestamps of words to improve readability when aligned on a timeline.
toSRT(): Converts the transcript to an SRT format blob, which can be downloaded and used with most video editing applications.
slice(wordCount: number): Creates a new Transcript containing only the specified number of words. This is useful for generating preview captions.

Iterating Over Words

The Transcript class offers a powerful iteration method via the iter function:


for (const group of transcript.iter({ count: [2] })) {
  // Each group will contain up to two words
}

The iter method allows you to iterate over words with various options, introducing a degree of randomness to improve captioning quality. If two values are provided, a random number between them is chosen.

Those are the available options for iteration:

count: iterate by word count
duration: iterate by group duration
length: iterate by the number of characters