Guides
Transcribe speech
֍
Learn how to transcribe speech from audio and video files
image

The TranscribeSpeech node transcribes speech from audio or video files. Supported input types include:

  • Base-64 encoded data strings (if your media is small enough to fit in a request payload). Be sure to include the data: prefix with a mime type (opens in a new tab).
  • Hosted media URLs (with a wide range of supported formats)
  • YouTube URLs

TranscribeSpeech also includes these built-in capabilities:

  • segmentation by sentence
  • diarization (speaker identification)
  • alignment to word-level timestamps
  • automatic chapter detection

To simply transcribe input without further processing, provide an audio_uri. This can be a publicly-hosted audio or video file, base-64-encoded audio or video data, or a privately-hosted external file (opens in a new tab). For best results, you may also provide a prompt that describes the content of the audio or video.

Python
TypeScript

from substrate import Substrate, TranscribeSpeech
# ...
transcript = TranscribeSpeech(
audio_uri="https://media.substrate.run/dfw-clip.m4a",
prompt="David Foster Wallace interviewed about US culture",
)
res = substrate.run(transcript)

Output

{
"text": "language like that, the wounded inner child, the inner pain, is part of a kind of pop psychological movement in the United States that is a sort of popular Freudianism that ..."
}

To enable additional capabilities, set:

  • segment: True to return a list of sentence segments with start and end timestamps.
  • align: True to return a list of aligned words within sentence segments.
  • diarize: True to include speaker IDs within segments and words.
  • suggest_chapters: True to return a list of suggested chapters with titles and start timestamps.
Python
TypeScript

transcript = TranscribeSpeech(
audio_uri="https://media.substrate.run/dfw-clip.m4a",
prompt="David Foster Wallace interviewed about US culture",
segment=True,
align=True,
diarize=True,
suggest_chapters=True,
)

Output

{
"text": "language like that, the wounded inner child, the inner pain, is part of a kind of pop psychological movement in the United States that is a sort of popular Freudianism that ...",
"segments": [
{
"start": 0.874,
"end": 15.353,
"speaker": "SPEAKER_00",
"text": "language like that, the wounded inner child, the inner pain, is part of a kind of pop psychological movement in the United States that is a sort of popular Freudianism that",
"words": [
{
"word": "language",
"start": 0.874,
"end": 1.275,
"speaker": "SPEAKER_00"
},
{
"word": "like",
"start": 1.295,
"end": 1.455,
"speaker": "SPEAKER_00"
}
]
}
],
"chapters": [
{
"title": "Introduction to the Wounded Inner Child and Popular Psychology in US",
"start": 0.794
},
{
"title": "The Paradox of Popular Psychology and Anger in America",
"start": 16.186
}
]
}

You can customize the chapter summarization feature by implementing your own pipeline. To learn how to do this, and see example of how to use text segments to create an animated captions experience, check out our runnable example on val.town (opens in a new tab). You can also find this example in the examples/descript directory of the substrate-python (opens in a new tab) and substrate-typescript (opens in a new tab) SDK repositories.

image