pixeltable.functions.whisperx - Pixeltable Documentation

WhisperX audio transcription and diarization functions. View source on GitHub

UDFs

`transcribe()` _udf

Transcribe an audio file using WhisperX. This UDF runs a transcription model locally using the WhisperX library, equivalent to the WhisperX transcribe function, as described in the WhisperX library documentation. If diarize=True, then speaker diarization will also be performed. Several of the UDF parameters are only valid if diarize=True, as documented in the parameters list below. Requirements:

pip install whisperx

Signature:

transcribe(
    audio: Audio,
    model: String,
    diarize: Bool,
    compute_type: Optional[String],
    language: Optional[String],
    task: Optional[String],
    chunk_size: Optional[Int],
    alignment_model_name: Optional[String],
    interpolate_method: Optional[String],
    return_char_alignments: Optional[Bool],
    diarization_model_name: Optional[String],
    num_speakers: Optional[Int],
    min_speakers: Optional[Int],
    max_speakers: Optional[Int]
)-> Json

Parameters:

audio (Audio): The audio file to transcribe.
model (String): The name of the model to use for transcription.
diarize (Bool): Whether to perform speaker diarization.
compute_type (Optional[String]): The compute type to use for the model (e.g., 'int8', 'float16'). If None, defaults to 'float16' on CUDA devices and 'int8' otherwise.
language (Optional[String]): The language code for the transcription (e.g., 'en' for English).
task (Optional[String]): The task to perform (e.g., 'transcribe' or 'translate'). Defaults to 'transcribe'.
chunk_size (Optional[Int]): The size of the audio chunks to process, in seconds. Defaults to 30.
alignment_model_name (Optional[String]): The name of the alignment model to use. If None, uses the default model for the given language. Only valid if diarize=True.
interpolate_method (Optional[String]): The method to use for interpolation of the alignment results. If not specified, uses the WhisperX default ('nearest'). Only valid if diarize=True.
return_char_alignments (Optional[Bool]): Whether to return character-level alignments. Defaults to False. Only valid if diarize=True.
diarization_model_name (Optional[String]): The name of the diarization model to use. Defaults to pyannote/speaker-diarization-3.1. Only valid if diarize=True.
num_speakers (Optional[Int]): The number of speakers to expect in the audio. By default, the model with try to detect the number of speakers. Only valid if diarize=True.
min_speakers (Optional[Int]): If specified, the minimum number of speakers to expect in the audio. Only valid if diarize=True.
max_speakers (Optional[Int]): If specified, the maximum number of speakers to expect in the audio. Only valid if diarize=True.

Returns:

Json: A dictionary containing the audio transcription, diarization (if enabled), and various other metadata.

Example: Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl:

tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en'))

Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl, with speaker diarization enabled, expecting at least 2 speakers:

tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en', diarize=True, min_speakers=2))

SDK Reference

​UDFs

​transcribe() udf

UDFs

`transcribe()` _udf