module pixeltable.functions.whisperx

WhisperX audio transcription and diarization functions.

udf transcribe()

Signature

@pxt.udf
transcribe(
    audio: pxt.Audio,
    *,
    model: pxt.String,
    diarize: pxt.Bool = False,
    compute_type: pxt.String | None = None,
    language: pxt.String | None = None,
    task: pxt.String | None = None,
    chunk_size: pxt.Int | None = None,
    alignment_model_name: pxt.String | None = None,
    interpolate_method: pxt.String | None = None,
    return_char_alignments: pxt.Bool | None = None,
    diarization_model_name: pxt.String | None = None,
    num_speakers: pxt.Int | None = None,
    min_speakers: pxt.Int | None = None,
    max_speakers: pxt.Int | None = None
) -> pxt.Json

Transcribe an audio file using WhisperX. This UDF runs a transcription model locally using the WhisperX library, equivalent to the WhisperX transcribe function, as described in the WhisperX library documentation. If diarize=True, then speaker diarization will also be performed. Several of the UDF parameters are only valid if diarize=True, as documented in the parameters list below. Requirements:

pip install whisperx

Parameters:

audio (pxt.Audio): The audio file to transcribe.
model (pxt.String): The name of the model to use for transcription.
diarize (pxt.Bool): Whether to perform speaker diarization.
compute_type (pxt.String | None): The compute type to use for the model (e.g., 'int8', 'float16'). If None, defaults to 'float16' on CUDA devices and 'int8' otherwise.
language (pxt.String | None): The language code for the transcription (e.g., 'en' for English).
task (pxt.String | None): The task to perform (e.g., 'transcribe' or 'translate'). Defaults to 'transcribe'.
chunk_size (pxt.Int | None): The size of the audio chunks to process, in seconds. Defaults to 30.
alignment_model_name (pxt.String | None): The name of the alignment model to use. If None, uses the default model for the given language. Only valid if diarize=True.
interpolate_method (pxt.String | None): The method to use for interpolation of the alignment results. If not specified, uses the WhisperX default ('nearest'). Only valid if diarize=True.
return_char_alignments (pxt.Bool | None): Whether to return character-level alignments. Defaults to False. Only valid if diarize=True.
diarization_model_name (pxt.String | None): The name of the diarization model to use. Defaults to pyannote/speaker-diarization-3.1. Only valid if diarize=True.
num_speakers (pxt.Int | None): The number of speakers to expect in the audio. By default, the model with try to detect the number of speakers. Only valid if diarize=True.
min_speakers (pxt.Int | None): If specified, the minimum number of speakers to expect in the audio. Only valid if diarize=True.
max_speakers (pxt.Int | None): If specified, the maximum number of speakers to expect in the audio. Only valid if diarize=True.

Returns:

pxt.Json: A dictionary containing the audio transcription, diarization (if enabled), and various other metadata.

Examples: Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl:

tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en'))

Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl, with speaker diarization enabled, expecting at least 2 speakers:

tbl.add_computed_column(
    result=transcribe(
        tbl.audio, model='tiny.en', diarize=True, min_speakers=2
    )
)

​module pixeltable.functions.whisperx

​udf transcribe()

module pixeltable.functions.whisperx

udf transcribe()