Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Generate YouTube-style chapter timestamps for podcast episodes. WhisperX detects where speech starts and stops, then an LLM summarizes each segment into a chapter title. The end result looks like this:
0:00 - Experiencing self vs remembering self
0:10 - Whether memories are the primary source of happiness
0:36 - Controlling how we remember experiences

Problem

You have podcast episodes and want to generate chapter markers — the kind you see in YouTube descriptions or podcast apps. Each chapter needs a timestamp and a short description of what’s being discussed.

Solution

What’s in this recipe:
  1. WhisperX transcribes the audio and detects speech boundaries via VAD (Voice Activity Detection)
  2. json.list_iterator creates a view with one row per speech segment
  3. GPT-4o-mini generates a chapter title for each segment as a computed column
  4. The result is formatted as YouTube-style timestamps

Setup

%pip install -qU pixeltable whisperx openai
import getpass
import os

if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
import pixeltable as pxt
import pixeltable.functions as pxtf

pxt.drop_dir('podcast_demo', force=True, if_not_exists='ignore')
pxt.create_dir('podcast_demo')

Load a podcast episode

episodes = pxt.create_table(
    'podcast_demo/episodes', {'title': pxt.String, 'audio': pxt.Audio}
)
Created table ‘episodes’.
episodes.insert(
    [
        {
            'title': 'Lex Fridman Podcast Excerpt',
            'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4',
        }
    ]
)
Inserted 1 row with 0 errors in 0.82 s (1.22 rows/s)
1 row inserted.
View the data:
episodes.collect()
And the table schema so far:
episodes

Step 1: Transcribe with WhisperX

WhisperX does two things at once: it runs Voice Activity Detection (VAD) to find where speech occurs, then transcribes each speech region. The output is a list of segments — each with a start time, end time, and the text that was spoken. These segments are the raw material for our chapter markers. Each segment boundary corresponds to a natural pause or transition in the conversation.
Note: You may see verbose warnings from torchcodec or pyannote when this cell runs — they’re harmless and can be ignored.
episodes.add_computed_column(
    transcription=pxtf.whisperx.transcribe(
        episodes.audio, model='tiny.en'
    )
)
Added 1 column value with 0 errors in 9.81 s (0.10 rows/s)
1 row updated.
episodes.select(
    first_segment=episodes.transcription.segments[0]
).collect()
Each segment is a dict with start, end, and text fields. We can use the ['*'] path expression to extract a field from every segment at once — this returns a list of values:
episodes.select(
    segment_starts=episodes.transcription.segments['*'].start,
    segment_ends=episodes.transcription.segments['*'].end,
    segment_text=episodes.transcription.segments['*'].text,
).collect()
That’s useful for peeking at the data, but the result is still one row with parallel lists — not one row per segment. Here’s what a single segment looks like as a proper row:
episodes.select(
    start=episodes.transcription.segments[0].start,
    end=episodes.transcription.segments[0].end,
    text=episodes.transcription.segments[0].text,
).collect()
We want every segment as its own row like this — not just the first one. That’s what list_iterator does.

Step 2: Create a segments view with list_iterator

To work with individual segments as rows, we create a view using pxtf.json.list_iterator. This iterator takes parallel lists and zips them into one row per element — like converting columns of arrays into a proper table. The keyword argument names (start, end, text) become the column names in the view. Each argument is a Pixeltable expression that evaluates to a JSON list. Why astype? WhisperX returns an untyped dict, so Pixeltable doesn’t know what types are inside the segments. list_iterator requires typed JSON so it can define the view’s schema. We use astype to declare that .start and .end are lists of floats, and .text is a list of strings:
segments = pxt.create_view(
    'podcast_demo/segments',
    episodes,
    iterator=pxtf.json.list_iterator(
        start=episodes.transcription.segments['*'].start.astype(
            pxt.Json[[float]]
        ),
        end=episodes.transcription.segments['*'].end.astype(
            pxt.Json[[float]]
        ),
        text=episodes.transcription.segments['*'].text.astype(
            pxt.Json[[str]]
        ),
    ),
)
Here is the schema of this view - you can see the four columns added by the list_iterator, and the other columns that came from the episodes table we started with:
segments
One row per segment, with typed columns. From here we can add computed columns that operate on individual segments — no manual JSON wrangling needed.
segments.select(segments.start, segments.end, segments.text).collect()

Step 3: Generate chapter titles

Each segment now has its own row. We add a computed column that sends the segment text to GPT-4o-mini for a short chapter title:
segments.add_computed_column(
    title_response=pxtf.openai.chat_completions(
        messages=[
            {
                'role': 'user',
                'content': pxtf.string.format(
                    'Write a short chapter title (5-10 words) for this podcast segment. '
                    'Return only the title, no quotes or extra punctuation.\n\n{0}',
                    segments.text,
                ),
            }
        ],
        model='gpt-4o-mini',
    ),
    if_exists='replace',
)

segments.add_computed_column(
    chapter_title=segments.title_response.choices[0].message.content,
    if_exists='replace',
)
Added 3 column values with 0 errors in 2.64 s (1.14 rows/s)
Added 3 column values with 0 errors in 0.02 s (142.43 rows/s)
3 rows updated.
segments.select(segments.text, segments.chapter_title).collect()

Step 4: YouTube-style timestamps

Format each segment’s start time with its LLM-generated title:
@pxt.udf
def format_timestamp(start: float, chapter_title: str) -> str:
    mins, secs = divmod(int(start), 60)
    title = chapter_title.strip('"')
    return f'{mins}:{secs:02d} - {title}'


segments.add_computed_column(
    timestamp=format_timestamp(segments.start, segments.chapter_title),
    if_exists='replace',
)
Added 3 column values with 0 errors in 0.04 s (69.92 rows/s)
3 rows updated.
segments.select(segments.timestamp).order_by(segments.start).collect()
Here is a formatted version you can copy/paste directly into your YouTube video description field:
print('\n'.join(segments.order_by(segments.start).collect()['timestamp']))
0:00 - Living in the Moment vs. Reflecting on the Past
0:10 - The Power of Memory in Shaping Happiness
0:36 - Evolving Memories for Lasting Happiness

Step 5: Listen to each segment

The segments view inherits the audio column from the base episodes table. We can add a computed column that slices out each segment’s audio clip using its start and end times — useful for spot-checking the transcription. In Jupyter, the audio_clip column renders as an inline audio player you can click to listen:
import os
import subprocess
import tempfile


@pxt.udf
def slice_audio(audio: pxt.Audio, start: float, end: float) -> pxt.Audio:
    """Extract a time range from an audio file using ffmpeg."""
    fd, output_path = tempfile.mkstemp(suffix='.mp4')
    os.close(fd)
    subprocess.run(
        [
            'ffmpeg',
            '-y',
            '-i',
            str(audio),
            '-ss',
            str(start),
            '-to',
            str(end),
            '-c',
            'copy',
            output_path,
        ],
        capture_output=True,
        check=True,
    )
    return output_path


segments.add_computed_column(
    audio_clip=slice_audio(segments.audio, segments.start, segments.end),
    if_exists='replace',
)
Added 3 column values with 0 errors in 0.28 s (10.85 rows/s)
3 rows updated.
segments.select(segments.timestamp, segments.audio_clip).order_by(
    segments.start
).collect()

Explanation

Pipeline:
Audio → WhisperX (VAD + transcription) → list_iterator view (1 row per segment) → GPT-4o-mini per row → timestamps
The episodes table holds the raw audio and transcription. The segments view fans out the transcription’s segment list into individual rows using json.list_iterator. Each segment row then gets its own LLM call and timestamp — all as computed columns. Insert a new episode and the entire pipeline runs automatically: transcription, segment extraction, chapter titling, and timestamp formatting. How WhisperX finds the chapter boundaries: WhisperX uses PyAnnote VAD to detect where speech occurs in the audio. Pauses, silence, and transitions between speakers create natural segment boundaries. These boundaries become the chapter start times. Why list_iterator? Without list_iterator, you’d write custom UDFs to extract segments from the JSON, batch them into a single LLM prompt, and parse the response back apart. The view-based approach is more idiomatic — each segment is its own row, and computed columns operate on one row at a time. Trade-offs:
  • One LLM call per segment instead of one batched call — fine for short podcasts, but consider batching for episodes with 50+ segments
  • This approach requires running a full transcription model just to find where the pauses are
  • For longer episodes, WhisperX’s chunk_size parameter controls how the audio is batched internally
  • The chapter titles depend on LLM quality — gpt-4o-mini is fast and cheap, use gpt-4o for higher quality

See also

Last modified on June 23, 2026