Transcribe Speech | Substrate

Learn how to transcribe speech from audio and video files

The TranscribeSpeech node transcribes speech from audio or video files. Supported input types include:

Base-64 encoded data strings (if your media is small enough to fit in a request payload). Be sure to include the data: prefix with a mime type (opens in a new tab).
Hosted media URLs (with a wide range of supported formats)
YouTube URLs

TranscribeSpeech also includes these built-in capabilities:

segmentation by sentence
diarization (speaker identification)
alignment to word-level timestamps
automatic chapter detection

To simply transcribe input without further processing, provide an audio_uri. This can be a publicly-hosted audio or video file, base-64-encoded audio or video data, or a privately-hosted external file (opens in a new tab). For best results, you may also provide a prompt that describes the content of the audio or video.

Python

TypeScript


from substrate import Substrate, TranscribeSpeech
# ...
transcript = TranscribeSpeech(
    audio_uri="https://media.substrate.run/dfw-clip.m4a",
    prompt="David Foster Wallace interviewed about US culture",
)
res = substrate.run(transcript)

Output


{
  "text": "language like that, the wounded inner child, the inner pain, is part of a kind of pop psychological movement in the United States that is a sort of popular Freudianism that ..."
}

To enable additional capabilities, set:

segment: True to return a list of sentence segments with start and end timestamps.
align: True to return a list of aligned words within sentence segments.
diarize: True to include speaker IDs within segments and words.
suggest_chapters: True to return a list of suggested chapters with titles and start timestamps.

Python

TypeScript


transcript = TranscribeSpeech(
    audio_uri="https://media.substrate.run/dfw-clip.m4a",
    prompt="David Foster Wallace interviewed about US culture",
    segment=True,
    align=True,
    diarize=True,
    suggest_chapters=True,
)

Output


{
  "text": "language like that, the wounded inner child, the inner pain, is part of a kind of pop psychological movement in the United States that is a sort of popular Freudianism that ...",
  "segments": [
    {
      "start": 0.874,
      "end": 15.353,
      "speaker": "SPEAKER_00",
      "text": "language like that, the wounded inner child, the inner pain, is part of a kind of pop psychological movement in the United States that is a sort of popular Freudianism that",
      "words": [
        {
          "word": "language",
          "start": 0.874,
          "end": 1.275,
          "speaker": "SPEAKER_00"
        },
        {
          "word": "like",
          "start": 1.295,
          "end": 1.455,
          "speaker": "SPEAKER_00"
        }
      ]
    }
  ],
  "chapters": [
    {
      "title": "Introduction to the Wounded Inner Child and Popular Psychology in US",
      "start": 0.794
    },
    {
      "title": "The Paradox of Popular Psychology and Anger in America",
      "start": 16.186
    }
  ]
}

You can customize the chapter summarization feature by implementing your own pipeline. To learn how to do this, and see example of how to use text segments to create an animated captions experience, check out our runnable example on val.town (opens in a new tab). You can also find this example in the examples/descript directory of the substrate-python (opens in a new tab) and substrate-typescript (opens in a new tab) SDK repositories.

Masked image generation Mixture of Agents