Whisper / WhisperX
Audio Transcription with Whisper in Pixeltable
Pixeltable provides native integration with OpenAI's Whisper speech recognition models for audio transcription. This integration allows you to process audio files or extract and transcribe audio from videos, with automatic handling of model inference and result storage.
Prerequisites
Before using Whisper in Pixeltable, install the required package:
pip install openai-whisper
Quick Start
Here's a simple example to get started with audio transcription:
import pixeltable as pxt
from pixeltable.functions import whisper
# Create a table for audio files
audio_table = pxt.create_table('transcription_demo.audio', {
'audio': pxt.AudioType()
})
# Add transcription as a computed column
audio_table['transcription'] = whisper.transcribe(
audio=audio_table.audio,
model='base.en' # English-optimized base model
)
# Insert some audio files
audio_table.insert([
{'audio': 'path/to/audio1.mp3'},
{'audio': 'path/to/audio2.wav'}
])
Available Models
Whisper offers several model sizes with different trade-offs between speed and accuracy:
Model | Parameters | English-only Size | Memory Required | Speed | Accuracy |
---|---|---|---|---|---|
tiny | 39M | ~140MB | ~1GB | Fastest | Base level |
tiny.en | 39M | ~140MB | ~1GB | Fastest | Base level |
base | 74M | ~290MB | ~1.5GB | Very Fast | Good |
base.en | 74M | ~290MB | ~1.5GB | Very Fast | Good |
small | 244M | ~960MB | ~2.5GB | Fast | Better |
small.en | 244M | ~960MB | ~2.5GB | Fast | Better |
medium | 769M | ~3GB | ~5GB | Moderate | High |
medium.en | 769M | ~3GB | ~5GB | Moderate | High |
large | 1550M | N/A | ~10GB | Slowest | Highest |
large-v2 | 1550M | N/A | ~10GB | Slowest | Highest |
Notes:
- Memory requirements are approximate and may vary based on system configuration
- '.en' models are optimized for English and typically faster than multilingual versions
- Processing speed depends heavily on available hardware (CPU/GPU)
- All models support 16-bit quantization to reduce memory usage
Video Audio Extraction and Transcription
A common use case is transcribing audio from videos:
from pixeltable.functions.video import extract_audio
# Create a table for videos
videos = pxt.create_table('transcription_demo.videos', {
'video': pxt.VideoType()
})
# Add computed column for audio extraction
videos['audio'] = extract_audio(videos.video, format='mp3')
# Add transcription
videos['transcription'] = whisper.transcribe(
audio=videos.audio,
model='base.en'
)
Understanding Transcription Results
The Whisper transcribe function returns a JSON structure containing:
{
"text": str, # The full transcription
"segments": [ # List of transcribed segments
{
"id": int, # Segment ID
"start": float,# Start time in seconds
"end": float, # End time in seconds
"text": str, # Segment text
},
...
],
"language": str # Detected language
}
Advanced Usage
Customizing Transcription Parameters
# Add transcription with custom parameters
audio_table['custom_transcription'] = whisper.transcribe(
audio=audio_table.audio,
model='medium',
temperature=[0.0, 0.2, 0.4], # Multiple temperatures for sampling
no_speech_threshold=0.6, # Threshold for filtering out non-speech
word_timestamps=True, # Get word-level timing
initial_prompt="Meeting transcript:", # Context for better transcription
condition_on_previous_text=True # Use previous text for context
)
Best Practices
- Model Selection
- Use 'tiny' or 'base' models for quick prototyping
- Use 'small' or 'medium' for production use
- Reserve 'large' for cases requiring maximum accuracy
- Performance Optimization
- Process audio in parallel when possible
- Use appropriate audio format and quality
- Consider chunking long audio files
WhisperX Integration in Pixeltable
WhisperX is an enhanced version of Whisper that provides faster transcription with word-level timestamps and speaker diarization. Pixeltable provides integration with WhisperX through its extended functions package (pixeltable.ext
).
Prerequisites
Install WhisperX before using it in Pixeltable:
pip install whisperx
Quick Start
import pixeltable as pxt
from pixeltable.ext.functions import whisperx
# Create a table for audio files
audio_table = pxt.create_table('transcription_demo.audio', {
'audio': pxt.AudioType()
})
# Add WhisperX transcription as a computed column
audio_table['transcription'] = whisperx.transcribe(
audio=audio_table.audio,
model='base.en',
compute_type='float16' # Optimize for GPU memory
)
Configuration Options
# Full configuration example
audio_table['detailed_transcription'] = whisperx.transcribe(
audio=audio_table.audio,
model='large-v2',
compute_type='float16', # Options: float32, float16, int8
language='en', # Specify language or None for auto-detect
chunk_size=30 # Seconds per processing chunk
)
Whisper v. WhisperX
Feature | WhisperX | Standard Whisper | Notes |
---|---|---|---|
Speed | 3-5x faster | Baseline | WhisperX uses batched inference and optimized VAD |
Initialization | from pixeltable.ext.functions import whisperx | from pixeltable.functions import whisper | WhisperX is in ext package due to experimental status |
Word Timestamps | Built-in, high accuracy | Optional, less accurate | WhisperX uses forced alignment for better precision |
Speaker Diarization | Supported via pyannote.audio | Not available | Requires additional HuggingFace token |
VAD Filtering | Advanced, removes silence | Basic | Better noise and silence handling |
GPU Memory Usage | Optimized (float16 support) | Standard | WhisperX offers more memory optimization options |
Installation | pip install whisperx | pip install openai-whisper | WhisperX requires additional dependencies |
Language Support | Same as Whisper | Same as Whisper | Both support 90+ languages |
Chunk Processing | Built-in efficient chunking | Manual chunking needed | WhisperX handles long audio better |
Model Options | tiny.en through large-v2 | tiny through large-v3 | Same model architecture, different processing |
Basic Usage | whisperx.transcribe(audio, model='base.en') | whisper.transcribe(audio, model='base.en') | Similar API, different parameters |
Compute Types | float32, float16, int8 | float32 only | More flexibility in memory/speed trade-offs |
Stability | Experimental | Stable | Standard Whisper has longer-term support |
Output Format | Extended JSON with speaker info | Basic JSON | WhisperX adds speaker and alignment data |
Alignment | Forced alignment with models | Basic alignment | Better timestamp accuracy |
Parameters | model, compute_type, language, chunk_size | model, temperature, condition_on_previous_text, ... | Different parameter sets |
Error Handling | Similar to Whisper | Standard Pixeltable error handling | Both use Pixeltable's error system |
Memory Required | Lower with optimizations | Higher baseline | WhisperX can run larger models in less memory |
Batch Processing | Native support | Manual implementation needed | Better for processing multiple files |
Real-time Factor | 0.2x - 1.0x RT | 0.5x - 2.0x RT | WhisperX generally faster |
Use Cases | • Professional transcription • Speaker identification • Large-scale processing | • Simple transcription • Development work • Standard workflows | Choose based on requirements |
Integration Cost | Higher (more dependencies) | Lower (simpler setup) | Trade-off between features and simplicity |
Maintenance | Community-driven | OpenAI maintained | Consider long-term support needs |
Customization | More options, higher complexity | Fewer options, simpler | WhisperX offers more control |
Key Considerations for Choosing:
-
Use WhisperX when you need:
- Faster processing speed
- Speaker diarization
- More accurate word timestamps
- Memory optimization options
- Batch processing capabilities
-
Use Standard Whisper when you need:
- Simpler implementation
- Stable, long-term support
- Basic transcription without speakers
- Minimal dependencies
- Production reliability
Updated 17 days ago