This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Convert speech to text locally using OpenAI’s open-source Whisper
model—no API key needed.
Problem
You have audio or video files that need transcription. Long files are
memory-intensive to process at once, so you need to split them into
manageable segments.
Solution
What’s in this recipe:
- Transcribe audio files locally with Whisper (no API key)
- Automatically segment long files
- Extract and transcribe audio from videos
You create a view with audio_splitter to break long files into
segments, then add a computed column for transcription. Whisper runs
locally on your machine—no API calls needed.
Setup
%pip install -qU pixeltable openai-whisper
import pixeltable as pxt
from pixeltable.functions import whisper
from pixeltable.functions.audio import audio_splitter
Load audio files
# Create a fresh directory
pxt.drop_dir('audio_demo', force=True)
pxt.create_dir('audio_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Converting metadata from version 45 to 46
Created directory ‘audio_demo’.
<pixeltable.catalog.dir.Dir at 0x169ab36a0>
# Create table for audio files
audio = pxt.create_table('audio_demo/files', {'audio': pxt.Audio})
Created table ‘files’.
# Insert a sample audio file (video files also work - audio is extracted automatically)
audio.insert(
[
{
'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4'
}
]
)
Inserted 1 row with 0 errors in 1.05 s (0.95 rows/s)
1 row inserted.
Split into segments
Create a view that splits audio into 30-second segments with overlap:
# Split audio into segments for transcription
segments = pxt.create_view(
'audio_demo/segments',
audio,
iterator=audio_splitter(
audio.audio,
duration=30.0, # 30-second segments
overlap=2.0, # 2-second overlap for context
min_segment_duration=5.0, # Drop segments shorter than 5 seconds
),
)
# View the segments
segments.select(segments.segment_start, segments.segment_end).collect()
Transcribe with Whisper
Add a computed column that transcribes each segment:
# Add transcription column (runs locally - no API key needed)
segments.add_computed_column(
transcription=whisper.transcribe(
audio=segments.audio_segment,
model='base.en', # Options: tiny.en, base.en, small.en, medium.en, large
)
)
Added 2 column values with 0 errors in 3.35 s (0.60 rows/s)
2 rows updated.
# Extract just the text
segments.add_computed_column(text=segments.transcription.text)
Added 2 column values with 0 errors in 0.06 s (31.82 rows/s)
2 rows updated.
# View transcriptions with timestamps
segments.select(
segments.segment_start, segments.segment_end, segments.text
).collect()
Explanation
Whisper models:
Models ending in .en are English-only and faster. Remove .en for
multilingual support.
audio_splitter parameters:
Video files work too:
When you insert a video file, Pixeltable automatically extracts the
audio track.
See also