pixeltable.functions.whisperx - Pixeltable Documentation

WhisperX audio transcription and diarization functions. View source on GitHub

udf `transcribe()`

transcribe(
    audio: Audio,
    *,
    model: String,
    diarize: Bool = False,
    compute_type: String | None = None,
    language: String | None = None,
    task: String | None = None,
    chunk_size: Int | None = None,
    alignment_model_name: String | None = None,
    interpolate_method: String | None = None,
    return_char_alignments: Bool | None = None,
    diarization_model_name: String | None = None,
    num_speakers: Int | None = None,
    min_speakers: Int | None = None,
    max_speakers: Int | None = None
) -> Json

Transcribe an audio file using WhisperX. This UDF runs a transcription model locally using the WhisperX library, equivalent to the WhisperX transcribe function, as described in the WhisperX library documentation. If diarize=True, then speaker diarization will also be performed. Several of the UDF parameters are only valid if diarize=True, as documented in the parameters list below. Requirements:

pip install whisperx

Parameters:

audio (Audio): The audio file to transcribe.
model (String): The name of the model to use for transcription.
diarize (Bool): Whether to perform speaker diarization.
compute_type (String | None): The compute type to use for the model (e.g., 'int8', 'float16'). If None, defaults to 'float16' on CUDA devices and 'int8' otherwise.
language (String | None): The language code for the transcription (e.g., 'en' for English).
task (String | None): The task to perform (e.g., 'transcribe' or 'translate'). Defaults to 'transcribe'.
chunk_size (Int | None): The size of the audio chunks to process, in seconds. Defaults to 30.
alignment_model_name (String | None): The name of the alignment model to use. If None, uses the default model for the given language. Only valid if diarize=True.
interpolate_method (String | None): The method to use for interpolation of the alignment results. If not specified, uses the WhisperX default ('nearest'). Only valid if diarize=True.
return_char_alignments (Bool | None): Whether to return character-level alignments. Defaults to False. Only valid if diarize=True.
diarization_model_name (String | None): The name of the diarization model to use. Defaults to pyannote/speaker-diarization-3.1. Only valid if diarize=True.
num_speakers (Int | None): The number of speakers to expect in the audio. By default, the model with try to detect the number of speakers. Only valid if diarize=True.
min_speakers (Int | None): If specified, the minimum number of speakers to expect in the audio. Only valid if diarize=True.
max_speakers (Int | None): If specified, the maximum number of speakers to expect in the audio. Only valid if diarize=True.

Returns:

Json: A dictionary containing the audio transcription, diarization (if enabled), and various other metadata.

Example: Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl:

tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en'))

Add a computed column that applies the model tiny.en to an existing Pixeltable column tbl.audio of the table tbl, with speaker diarization enabled, expecting at least 2 speakers:

tbl.add_computed_column(result=transcribe(tbl.audio, model='tiny.en', diarize=True, min_speakers=2))

SDK Reference

​udf transcribe()

udf `transcribe()`