Skip to Content
Welcome to Diffusion Studio Core v3.0 - Now Available! šŸŽ‰

Transcript

The Transcript class is designed to handle text-to-speech outputs generated by machine learning models, such as OpenAIā€™s Whisper. It supports outputs that include word-level timestamps.

Constructing a Transcript

You typically create a Transcript instance from JSON data. The JSON should adhere to the following structure:

type Captions = { token: string; // The spoken word start: number; // The start in milliseconds stop: number; // The stop in milliseconds }[][];

The JSON structure is a 3-dimensional array, where the first level represents sentences, and each sentence contains a list of words or tokens. This structure preserves the semantic grouping of words.

To create a Transcript from this JSON, use the following:

import { Transcript } from '@diffusionstudio/core'; const transcript = Transcript.fromJSON(captions); // `captions` is of type Captions

Manual Construction

You can also manually create a Transcript instance:

import { Transcript, WordGroup, Word } from '@diffusionstudio/core'; const transcript = new Transcript([ new WordGroup([ new Word('Hello', 0, 300), new Word('World', 320, 600), ]) ]);

Utility Methods

The Transcript class provides several utility methods:

transcript.optimize(); transcript.toSRT(); transcript.slice(20);
  • optimize(): Adjusts the timestamps of words to improve readability when aligned on a timeline.
  • toSRT(): Converts the transcript to an SRT format blob, which can be downloaded and used with most video editing applications.
  • slice(wordCount: number): Creates a new Transcript containing only the specified number of words. This is useful for generating preview captions.

Iterating Over Words

The Transcript class offers a powerful iteration method via the iter function:

for (const group of transcript.iter({ count: [2] })) { // Each group will contain up to two words }

The iter method allows you to iterate over words with various options, introducing a degree of randomness to improve captioning quality. If two values are provided, a random number between them is chosen.

Those are the available options for iteration:

  • count: iterate by word count
  • duration: iterate by group duration
  • length: iterate by the number of characters
Last updated on