Whisper / WhisperX

Audio Transcription with Whisper in Pixeltable

Pixeltable provides native integration with OpenAI's Whisper speech recognition models for audio transcription. This integration allows you to process audio files or extract and transcribe audio from videos, with automatic handling of model inference and result storage.

Prerequisites

Before using Whisper in Pixeltable, install the required package:

pip install openai-whisper

Quick Start

Here's a simple example to get started with audio transcription:

import pixeltable as pxt
from pixeltable.functions import whisper

# Create a table for audio files
audio_table = pxt.create_table('transcription_demo.audio', {
    'audio': pxt.AudioType()
})

# Add transcription as a computed column
audio_table['transcription'] = whisper.transcribe(
    audio=audio_table.audio,
    model='base.en'  # English-optimized base model
)

# Insert some audio files
audio_table.insert([
    {'audio': 'path/to/audio1.mp3'},
    {'audio': 'path/to/audio2.wav'}
])

Available Models

Whisper offers several model sizes with different trade-offs between speed and accuracy:

ModelParametersEnglish-only SizeMemory RequiredSpeedAccuracy
tiny39M~140MB~1GBFastestBase level
tiny.en39M~140MB~1GBFastestBase level
base74M~290MB~1.5GBVery FastGood
base.en74M~290MB~1.5GBVery FastGood
small244M~960MB~2.5GBFastBetter
small.en244M~960MB~2.5GBFastBetter
medium769M~3GB~5GBModerateHigh
medium.en769M~3GB~5GBModerateHigh
large1550MN/A~10GBSlowestHighest
large-v21550MN/A~10GBSlowestHighest

Notes:

  • Memory requirements are approximate and may vary based on system configuration
  • '.en' models are optimized for English and typically faster than multilingual versions
  • Processing speed depends heavily on available hardware (CPU/GPU)
  • All models support 16-bit quantization to reduce memory usage

Video Audio Extraction and Transcription

A common use case is transcribing audio from videos:

from pixeltable.functions.video import extract_audio

# Create a table for videos
videos = pxt.create_table('transcription_demo.videos', {
    'video': pxt.VideoType()
})

# Add computed column for audio extraction
videos['audio'] = extract_audio(videos.video, format='mp3')

# Add transcription
videos['transcription'] = whisper.transcribe(
    audio=videos.audio,
    model='base.en'
)

Understanding Transcription Results

The Whisper transcribe function returns a JSON structure containing:

{
    "text": str,           # The full transcription
    "segments": [          # List of transcribed segments
        {
            "id": int,     # Segment ID
            "start": float,# Start time in seconds
            "end": float,  # End time in seconds
            "text": str,   # Segment text
        },
        ...
    ],
    "language": str       # Detected language
}

Advanced Usage

Customizing Transcription Parameters

# Add transcription with custom parameters
audio_table['custom_transcription'] = whisper.transcribe(
    audio=audio_table.audio,
    model='medium',
    temperature=[0.0, 0.2, 0.4],  # Multiple temperatures for sampling
    no_speech_threshold=0.6,      # Threshold for filtering out non-speech
    word_timestamps=True,         # Get word-level timing
    initial_prompt="Meeting transcript:",  # Context for better transcription
    condition_on_previous_text=True  # Use previous text for context
)

Best Practices

  1. Model Selection
    1. Use 'tiny' or 'base' models for quick prototyping
    2. Use 'small' or 'medium' for production use
    3. Reserve 'large' for cases requiring maximum accuracy
  2. Performance Optimization
    1. Process audio in parallel when possible
    2. Use appropriate audio format and quality
    3. Consider chunking long audio files

WhisperX Integration in Pixeltable

WhisperX is an enhanced version of Whisper that provides faster transcription with word-level timestamps and speaker diarization. Pixeltable provides integration with WhisperX through its extended functions package (pixeltable.ext).

Prerequisites

Install WhisperX before using it in Pixeltable:

pip install whisperx 

Quick Start

import pixeltable as pxt
from pixeltable.ext.functions import whisperx

# Create a table for audio files
audio_table = pxt.create_table('transcription_demo.audio', {
    'audio': pxt.AudioType()
})

# Add WhisperX transcription as a computed column
audio_table['transcription'] = whisperx.transcribe(
    audio=audio_table.audio,
    model='base.en',
    compute_type='float16'  # Optimize for GPU memory
)

Configuration Options

# Full configuration example
audio_table['detailed_transcription'] = whisperx.transcribe(
    audio=audio_table.audio,
    model='large-v2',
    compute_type='float16',  # Options: float32, float16, int8
    language='en',           # Specify language or None for auto-detect
    chunk_size=30           # Seconds per processing chunk
)

Whisper v. WhisperX

FeatureWhisperXStandard WhisperNotes
Speed3-5x fasterBaselineWhisperX uses batched inference and optimized VAD
Initializationfrom pixeltable.ext.functions import whisperxfrom pixeltable.functions import whisperWhisperX is in ext package due to experimental status
Word TimestampsBuilt-in, high accuracyOptional, less accurateWhisperX uses forced alignment for better precision
Speaker DiarizationSupported via pyannote.audioNot availableRequires additional HuggingFace token
VAD FilteringAdvanced, removes silenceBasicBetter noise and silence handling
GPU Memory UsageOptimized (float16 support)StandardWhisperX offers more memory optimization options
Installationpip install whisperxpip install openai-whisperWhisperX requires additional dependencies
Language SupportSame as WhisperSame as WhisperBoth support 90+ languages
Chunk ProcessingBuilt-in efficient chunkingManual chunking neededWhisperX handles long audio better
Model Optionstiny.en through large-v2tiny through large-v3Same model architecture, different processing
Basic Usagewhisperx.transcribe(audio, model='base.en')whisper.transcribe(audio, model='base.en')Similar API, different parameters
Compute Typesfloat32, float16, int8float32 onlyMore flexibility in memory/speed trade-offs
StabilityExperimentalStableStandard Whisper has longer-term support
Output FormatExtended JSON with speaker infoBasic JSONWhisperX adds speaker and alignment data
AlignmentForced alignment with modelsBasic alignmentBetter timestamp accuracy
Parametersmodel, compute_type, language, chunk_sizemodel, temperature, condition_on_previous_text, ...Different parameter sets
Error HandlingSimilar to WhisperStandard Pixeltable error handlingBoth use Pixeltable's error system
Memory RequiredLower with optimizationsHigher baselineWhisperX can run larger models in less memory
Batch ProcessingNative supportManual implementation neededBetter for processing multiple files
Real-time Factor0.2x - 1.0x RT0.5x - 2.0x RTWhisperX generally faster
Use Cases• Professional transcription
• Speaker identification
• Large-scale processing
• Simple transcription
• Development work
• Standard workflows
Choose based on requirements
Integration CostHigher (more dependencies)Lower (simpler setup)Trade-off between features and simplicity
MaintenanceCommunity-drivenOpenAI maintainedConsider long-term support needs
CustomizationMore options, higher complexityFewer options, simplerWhisperX offers more control

Key Considerations for Choosing:

  • Use WhisperX when you need:

    • Faster processing speed
    • Speaker diarization
    • More accurate word timestamps
    • Memory optimization options
    • Batch processing capabilities
  • Use Standard Whisper when you need:

    • Simpler implementation
    • Stable, long-term support
    • Basic transcription without speakers
    • Minimal dependencies
    • Production reliability