Skip to main content
WhisperX audio transcription and diarization functions. View source on GitHub

udf transcribe()

transcribe(
    audio: Audio,
    *,
    model: String,
    diarize: Bool = False,
    compute_type: String | None = None,
    language: String | None = None,
    task: String | None = None,
    chunk_size: Int | None = None,
    alignment_model_name: String | None = None,
    interpolate_method: String | None = None,
    return_char_alignments: Bool | None = None,
    diarization_model_name: String | None = None,
    num_speakers: Int | None = None,
    min_speakers: Int | None = None,
    max_speakers: Int | None = None
) -> Json
Transcribe an audio file using WhisperX. This UDF runs a transcription model locally using the WhisperX library, equivalent to the WhisperX transcribe function, as described in the WhisperX library documentation. If diarize=True, then speaker diarization will also be performed. Several of the UDF parameters are only valid if diarize=True, as documented in the parameters list below. Requirements:
  • pip install whisperx
Parameters:
  • audio (Audio): The audio file to transcribe.
  • model (String): The name of the model to use for transcription.
  • diarize (Bool): Whether to perform speaker diarization.
  • compute_type (String | None): The compute type to use for the model (e.g., 'int8', 'float16'). If None, defaults to 'float16' on CUDA devices and 'int8' otherwise.
  • language (String | None): The language code for the transcription (e.g., 'en' for English).
  • task (String | None): The task to perform (e.g., 'transcribe' or 'translate'). Defaults to 'transcribe'.
  • chunk_size (Int | None): The size of the audio chunks to process, in seconds. Defaults to 30.
  • alignment_model_name (String | None): The name of the alignment model to use. If None, uses the default model for the given language. Only valid if diarize=True.
  • interpolate_method (String | None): The method to use for interpolation of the alignment results. If not specified, uses the WhisperX default ('nearest'). Only valid if diarize=True.
  • return_char_alignments (Bool | None): Whether to return character-level alignments. Defaults to False. Only valid if diarize=True.
  • diarization_model_name (String | None): The name of the diarization model to use. Defaults to pyannote/speaker-diarization-3.1. Only valid if diarize=True.
  • num_speakers (Int | None): The number of speakers to expect in the audio. By default, the model with try to detect the number of speakers. Only valid if diarize=True.
  • min_speakers (Int | None): If specified, the minimum number of speakers to expect in the audio. Only valid if diarize=True.
  • max_speakers (Int | None): If specified, the maximum number of speakers to expect in the audio. Only valid if diarize=True.
Returns:
  • Json: A dictionary containing the audio transcription, diarization (if enabled), and various other metadata.
Example: Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl:
tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en'))
Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl, with speaker diarization enabled, expecting at least 2 speakers:
tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en', diarize=True, min_speakers=2))