Skip to main content
WhisperX audio transcription and diarization functions. View source on GitHub

UDFs


transcribe() udf

Transcribe an audio file using WhisperX. This UDF runs a transcription model locally using the WhisperX library, equivalent to the WhisperX transcribe function, as described in the WhisperX library documentation. If diarize=True, then speaker diarization will also be performed. Several of the UDF parameters are only valid if diarize=True, as documented in the parameters list below. Requirements:
  • pip install whisperx
Signature:
transcribe(
    audio: Audio,
    model: String,
    diarize: Bool,
    compute_type: Optional[String],
    language: Optional[String],
    task: Optional[String],
    chunk_size: Optional[Int],
    alignment_model_name: Optional[String],
    interpolate_method: Optional[String],
    return_char_alignments: Optional[Bool],
    diarization_model_name: Optional[String],
    num_speakers: Optional[Int],
    min_speakers: Optional[Int],
    max_speakers: Optional[Int]
)-> Json
Parameters:
  • audio (Audio): The audio file to transcribe.
  • model (String): The name of the model to use for transcription.
  • diarize (Bool): Whether to perform speaker diarization.
  • compute_type (Optional[String]): The compute type to use for the model (e.g., 'int8', 'float16'). If None, defaults to 'float16' on CUDA devices and 'int8' otherwise.
  • language (Optional[String]): The language code for the transcription (e.g., 'en' for English).
  • task (Optional[String]): The task to perform (e.g., 'transcribe' or 'translate'). Defaults to 'transcribe'.
  • chunk_size (Optional[Int]): The size of the audio chunks to process, in seconds. Defaults to 30.
  • alignment_model_name (Optional[String]): The name of the alignment model to use. If None, uses the default model for the given language. Only valid if diarize=True.
  • interpolate_method (Optional[String]): The method to use for interpolation of the alignment results. If not specified, uses the WhisperX default ('nearest'). Only valid if diarize=True.
  • return_char_alignments (Optional[Bool]): Whether to return character-level alignments. Defaults to False. Only valid if diarize=True.
  • diarization_model_name (Optional[String]): The name of the diarization model to use. Defaults to pyannote/speaker-diarization-3.1. Only valid if diarize=True.
  • num_speakers (Optional[Int]): The number of speakers to expect in the audio. By default, the model with try to detect the number of speakers. Only valid if diarize=True.
  • min_speakers (Optional[Int]): If specified, the minimum number of speakers to expect in the audio. Only valid if diarize=True.
  • max_speakers (Optional[Int]): If specified, the maximum number of speakers to expect in the audio. Only valid if diarize=True.
Returns:
  • Json: A dictionary containing the audio transcription, diarization (if enabled), and various other metadata.
Example: Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl:
tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en'))
Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl, with speaker diarization enabled, expecting at least 2 speakers:
tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en', diarize=True, min_speakers=2))
I