Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Convert speech to text locally using OpenAI’s open-source Whisper model—no API key needed.

Problem

You have audio or video files that need transcription. Long files are memory-intensive to process at once, so you need to split them into manageable chunks.

Solution

What’s in this recipe:
  • Transcribe audio files locally with Whisper (no API key)
  • Automatically chunk long files
  • Extract and transcribe audio from videos
You create a view with AudioSplitter to break long files into chunks, then add a computed column for transcription. Whisper runs locally on your machine—no API calls needed.

Setup

%pip install -qU pixeltable openai-whisper
import pixeltable as pxt
from pixeltable.iterators import AudioSplitter
from pixeltable.functions import whisper

Load audio files

# Create a fresh directory
pxt.drop_dir('audio_demo', force=True)
pxt.create_dir('audio_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘audio_demo’.
<pixeltable.catalog.dir.Dir at 0x16218a9d0>
# Create table for audio files
audio = pxt.create_table('audio_demo.files', {'audio': pxt.Audio})
Created table ‘files’.
# Insert a sample audio file (video files also work - audio is extracted automatically)
audio.insert([
    {'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4'}
])
Inserting rows into `files`: 1 rows [00:00, 285.54 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.

Split into chunks

Create a view that splits audio into 30-second chunks with overlap:
# Split audio into chunks for transcription
chunks = pxt.create_view(
    'audio_demo.chunks',
    audio,
    iterator=AudioSplitter.create(
        audio=audio.audio,
        chunk_duration_sec=30.0,  # 30-second chunks
        overlap_sec=2.0,          # 2-second overlap for context
        min_chunk_duration_sec=5.0  # Drop chunks shorter than 5 seconds
    )
)
Inserting rows into `chunks`: 2 rows [00:00, 1257.10 rows/s]
# View the chunks
chunks.select(chunks.start_time_sec, chunks.end_time_sec).collect()

Transcribe with Whisper

Add a computed column that transcribes each chunk:
# Add transcription column (runs locally - no API key needed)
chunks.add_computed_column(
    transcription=whisper.transcribe(
        audio=chunks.audio_chunk,
        model='base.en'  # Options: tiny.en, base.en, small.en, medium.en, large
    )
)
Added 2 column values with 0 errors.
2 rows updated, 2 values computed.
# Extract just the text
chunks.add_computed_column(text=chunks.transcription.text)
Added 2 column values with 0 errors.
2 rows updated, 2 values computed.
# View transcriptions with timestamps
chunks.select(chunks.start_time_sec, chunks.end_time_sec, chunks.text).collect()

Explanation

Whisper models:
Models ending in .en are English-only and faster. Remove .en for multilingual support. AudioSplitter parameters:
Video files work too: When you insert a video file, Pixeltable automatically extracts the audio track.

See also