Transcribe audio files with Whisper

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Convert speech to text locally using OpenAI’s open-source Whisper model—no API key needed.

Problem

You have audio or video files that need transcription. Long files are memory-intensive to process at once, so you need to split them into manageable segments.

Solution

What’s in this recipe:

Transcribe audio files locally with Whisper (no API key)
Automatically segment long files
Extract and transcribe audio from videos

You create a view with audio_splitter to break long files into segments, then add a computed column for transcription. Whisper runs locally on your machine—no API calls needed.

Setup

%pip install -qU pixeltable openai-whisper

import pixeltable as pxt
from pixeltable.functions import whisper
from pixeltable.functions.audio import audio_splitter

Load audio files

# Create a fresh directory
pxt.drop_dir('audio_demo', force=True)
pxt.create_dir('audio_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Converting metadata from version 45 to 46
Created directory ‘audio_demo’.
<pixeltable.catalog.dir.Dir at 0x169ab36a0>

# Create table for audio files
audio = pxt.create_table('audio_demo/files', {'audio': pxt.Audio})

Created table ‘files’.

# Insert a sample audio file (video files also work - audio is extracted automatically)
audio.insert(
    [
        {
            'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4'
        }
    ]
)

Inserted 1 row with 0 errors in 1.05 s (0.95 rows/s)
1 row inserted.

Split into segments

Create a view that splits audio into 30-second segments with overlap:

# Split audio into segments for transcription
segments = pxt.create_view(
    'audio_demo/segments',
    audio,
    iterator=audio_splitter(
        audio.audio,
        duration=30.0,  # 30-second segments
        overlap=2.0,  # 2-second overlap for context
        min_segment_duration=5.0,  # Drop segments shorter than 5 seconds
    ),
)

# View the segments
segments.select(segments.segment_start, segments.segment_end).collect()

Transcribe with Whisper

Add a computed column that transcribes each segment:

# Add transcription column (runs locally - no API key needed)
segments.add_computed_column(
    transcription=whisper.transcribe(
        audio=segments.audio_segment,
        model='base.en',  # Options: tiny.en, base.en, small.en, medium.en, large
    )
)

Added 2 column values with 0 errors in 3.35 s (0.60 rows/s)
2 rows updated.

# Extract just the text
segments.add_computed_column(text=segments.transcription.text)

Added 2 column values with 0 errors in 0.06 s (31.82 rows/s)
2 rows updated.

# View transcriptions with timestamps
segments.select(
    segments.segment_start, segments.segment_end, segments.text
).collect()

Explanation

Whisper models:

Models ending in .en are English-only and faster. Remove .en for multilingual support. audio_splitter parameters:

Video files work too: When you insert a video file, Pixeltable automatically extracts the audio track.

Welcome to Pixeltable

Core Concepts

How-To

Problem

Solution

Setup

Load audio files

Split into segments

Transcribe with Whisper

Explanation

See also

Welcome to Pixeltable

Core Concepts

How-To

​Problem

​Solution

​Setup

​Load audio files

​Split into segments

​Transcribe with Whisper

​Explanation

​See also

Problem

Solution

Setup

Load audio files

Split into segments

Transcribe with Whisper

Explanation

See also