This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Generate YouTube-style chapter timestamps for podcast episodes. WhisperX
detects where speech starts and stops, then an LLM summarizes each
segment into a chapter title.
The end result looks like this:
0:00 - Experiencing self vs remembering self
0:10 - Whether memories are the primary source of happiness
0:36 - Controlling how we remember experiences
Problem
You have podcast episodes and want to generate chapter markers — the
kind you see in YouTube descriptions or podcast apps. Each chapter needs
a timestamp and a short description of what’s being discussed.
Solution
What’s in this recipe:
- WhisperX transcribes the audio and detects speech boundaries via
VAD (Voice Activity Detection)
json.list_iterator creates a view with one row per speech
segment
- GPT-4o-mini generates a chapter title for each segment as a
computed column
- The result is formatted as YouTube-style timestamps
Setup
%pip install -qU pixeltable whisperx openai
import getpass
import os
if 'OPENAI_API_KEY' not in os.environ:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
import pixeltable as pxt
import pixeltable.functions as pxtf
pxt.drop_dir('podcast_demo', force=True, if_not_exists='ignore')
pxt.create_dir('podcast_demo')
Load a podcast episode
episodes = pxt.create_table(
'podcast_demo/episodes', {'title': pxt.String, 'audio': pxt.Audio}
)
Created table ‘episodes’.
episodes.insert(
[
{
'title': 'Lex Fridman Podcast Excerpt',
'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4',
}
]
)
Inserted 1 row with 0 errors in 0.82 s (1.22 rows/s)
1 row inserted.
View the data:
And the table schema so far:
Step 1: Transcribe with WhisperX
WhisperX does two things at once: it runs Voice Activity Detection (VAD)
to find where speech occurs, then transcribes each speech region. The
output is a list of segments — each with a start time, end time, and the
text that was spoken.
These segments are the raw material for our chapter markers. Each
segment boundary corresponds to a natural pause or transition in the
conversation.
Note: You may see verbose warnings from torchcodec or pyannote
when this cell runs — they’re harmless and can be ignored.
episodes.add_computed_column(
transcription=pxtf.whisperx.transcribe(
episodes.audio, model='tiny.en'
)
)
Added 1 column value with 0 errors in 9.81 s (0.10 rows/s)
1 row updated.
episodes.select(
first_segment=episodes.transcription.segments[0]
).collect()
Each segment is a dict with start, end, and text fields. We can
use the ['*'] path expression to extract a field from every segment at
once — this returns a list of values:
episodes.select(
segment_starts=episodes.transcription.segments['*'].start,
segment_ends=episodes.transcription.segments['*'].end,
segment_text=episodes.transcription.segments['*'].text,
).collect()
That’s useful for peeking at the data, but the result is still one row
with parallel lists — not one row per segment. Here’s what a single
segment looks like as a proper row:
episodes.select(
start=episodes.transcription.segments[0].start,
end=episodes.transcription.segments[0].end,
text=episodes.transcription.segments[0].text,
).collect()
We want every segment as its own row like this — not just the first one.
That’s what list_iterator does.
Step 2: Create a segments view with list_iterator
To work with individual segments as rows, we create a view using
pxtf.json.list_iterator. This iterator takes parallel lists and zips
them into one row per element — like converting columns of arrays into a
proper table.
The keyword argument names (start, end, text) become the column
names in the view. Each argument is a Pixeltable expression that
evaluates to a JSON list.
Why astype? WhisperX returns an untyped dict, so Pixeltable
doesn’t know what types are inside the segments. list_iterator
requires typed JSON so it can define the view’s schema. We use astype
to declare that .start and .end are lists of floats, and .text is
a list of strings:
segments = pxt.create_view(
'podcast_demo/segments',
episodes,
iterator=pxtf.json.list_iterator(
start=episodes.transcription.segments['*'].start.astype(
pxt.Json[[float]]
),
end=episodes.transcription.segments['*'].end.astype(
pxt.Json[[float]]
),
text=episodes.transcription.segments['*'].text.astype(
pxt.Json[[str]]
),
),
)
Here is the schema of this view - you can see the four columns added by
the list_iterator, and the other columns that came from the episodes
table we started with:
One row per segment, with typed columns. From here we can add computed
columns that operate on individual segments — no manual JSON wrangling
needed.
segments.select(segments.start, segments.end, segments.text).collect()
Step 3: Generate chapter titles
Each segment now has its own row. We add a computed column that sends
the segment text to GPT-4o-mini for a short chapter title:
segments.add_computed_column(
title_response=pxtf.openai.chat_completions(
messages=[
{
'role': 'user',
'content': pxtf.string.format(
'Write a short chapter title (5-10 words) for this podcast segment. '
'Return only the title, no quotes or extra punctuation.\n\n{0}',
segments.text,
),
}
],
model='gpt-4o-mini',
),
if_exists='replace',
)
segments.add_computed_column(
chapter_title=segments.title_response.choices[0].message.content,
if_exists='replace',
)
Added 3 column values with 0 errors in 2.64 s (1.14 rows/s)
Added 3 column values with 0 errors in 0.02 s (142.43 rows/s)
3 rows updated.
segments.select(segments.text, segments.chapter_title).collect()
Step 4: YouTube-style timestamps
Format each segment’s start time with its LLM-generated title:
@pxt.udf
def format_timestamp(start: float, chapter_title: str) -> str:
mins, secs = divmod(int(start), 60)
title = chapter_title.strip('"')
return f'{mins}:{secs:02d} - {title}'
segments.add_computed_column(
timestamp=format_timestamp(segments.start, segments.chapter_title),
if_exists='replace',
)
Added 3 column values with 0 errors in 0.04 s (69.92 rows/s)
3 rows updated.
segments.select(segments.timestamp).order_by(segments.start).collect()
Here is a formatted version you can copy/paste directly into your
YouTube video description field:
print('\n'.join(segments.order_by(segments.start).collect()['timestamp']))
0:00 - Living in the Moment vs. Reflecting on the Past
0:10 - The Power of Memory in Shaping Happiness
0:36 - Evolving Memories for Lasting Happiness
Step 5: Listen to each segment
The segments view inherits the audio column from the base episodes
table. We can add a computed column that slices out each segment’s audio
clip using its start and end times — useful for spot-checking the
transcription.
In Jupyter, the audio_clip column renders as an inline audio player
you can click to listen:
import os
import subprocess
import tempfile
@pxt.udf
def slice_audio(audio: pxt.Audio, start: float, end: float) -> pxt.Audio:
"""Extract a time range from an audio file using ffmpeg."""
fd, output_path = tempfile.mkstemp(suffix='.mp4')
os.close(fd)
subprocess.run(
[
'ffmpeg',
'-y',
'-i',
str(audio),
'-ss',
str(start),
'-to',
str(end),
'-c',
'copy',
output_path,
],
capture_output=True,
check=True,
)
return output_path
segments.add_computed_column(
audio_clip=slice_audio(segments.audio, segments.start, segments.end),
if_exists='replace',
)
Added 3 column values with 0 errors in 0.28 s (10.85 rows/s)
3 rows updated.
segments.select(segments.timestamp, segments.audio_clip).order_by(
segments.start
).collect()
Explanation
Pipeline:
Audio → WhisperX (VAD + transcription) → list_iterator view (1 row per segment) → GPT-4o-mini per row → timestamps
The episodes table holds the raw audio and transcription. The
segments view fans out the transcription’s segment list into
individual rows using json.list_iterator. Each segment row then gets
its own LLM call and timestamp — all as computed columns.
Insert a new episode and the entire pipeline runs automatically:
transcription, segment extraction, chapter titling, and timestamp
formatting.
How WhisperX finds the chapter boundaries:
WhisperX uses PyAnnote VAD to detect where speech occurs in the audio.
Pauses, silence, and transitions between speakers create natural segment
boundaries. These boundaries become the chapter start times.
Why list_iterator?
Without list_iterator, you’d write custom UDFs to extract segments
from the JSON, batch them into a single LLM prompt, and parse the
response back apart. The view-based approach is more idiomatic — each
segment is its own row, and computed columns operate on one row at a
time.
Trade-offs:
- One LLM call per segment instead of one batched call — fine for short
podcasts, but consider batching for episodes with 50+ segments
- This approach requires running a full transcription model just to find
where the pauses are
- For longer episodes, WhisperX’s
chunk_size parameter controls how
the audio is batched internally
- The chapter titles depend on LLM quality —
gpt-4o-mini is fast and
cheap, use gpt-4o for higher quality
See also