This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Transform a single document, video, image, or audio file into multiple
rows for granular processing.
What’s in this recipe:
- Split documents into text chunks for RAG
- Extract frames or segments from videos
- Tile images for high-resolution analysis
- Chunk audio files for transcription
Problem
You have documents, videos, or text that you need to break into smaller
pieces for processing. A PDF needs to be split into chunks for
retrieval-augmented generation. A video needs individual frames for
analysis. Text needs to be divided into sentences or sliding windows.
You need a way to transform one source row into multiple output rows
automatically.
Solution
You create views with iterator functions that split source data into
multiple rows. Pixeltable provides built-in iterators for documents,
videos, images, audio, and strings.
Setup
%pip install -qU pixeltable
Split documents into chunks
Use document_splitter to break documents (PDF, HTML, Markdown, TXT)
into text chunks.
from pixeltable.functions.document import document_splitter
pxt.drop_dir('split_demo', force=True)
pxt.create_dir('split_demo')
docs = pxt.create_table('split_demo.docs', {'doc': pxt.Document})
docs.insert([{'doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Jefferson-Amazon.pdf'}])
Created directory ‘split_demo’.
Created table ‘docs’.
Inserting rows into `docs`: 1 rows [00:00, 821.77 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.
chunks = pxt.create_view(
'split_demo.doc_chunks',
docs,
iterator=document_splitter(docs.doc, separators='sentence,token_limit', limit=300)
)
chunks.select(chunks.text).limit(3).collect()
Inserting rows into `doc_chunks`: 139 rows [00:00, 42530.51 rows/s]
Available separators:
heading — Split on HTML/Markdown headings
sentence — Split on sentence boundaries (requires spacy)
token_limit — Split by token count (requires tiktoken)
char_limit — Split by character count
page — Split by page (PDF only)
SDK Reference:
document_splitter
Use frame_iterator to extract frames at specified intervals.
from pixeltable.functions.video import frame_iterator
videos = pxt.create_table('split_demo.videos', {'video': pxt.Video})
videos.insert([{'video': 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/bangkok.mp4'}])
Created table ‘videos’.
Inserting rows into `videos`: 1 rows [00:00, 889.00 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.
frames = pxt.create_view(
'split_demo.frames',
videos,
iterator=frame_iterator(videos.video, fps=1.0)
)
frames.select(frames.frame_idx, frames.pos_msec, frames.frame).limit(3).collect()
Inserting rows into `frames`: 19 rows [00:00, 9346.91 rows/s]
frame_iterator options:
fps — Frames per second to extract
num_frames — Extract exact number of frames (evenly spaced)
keyframes_only — Extract only keyframes
SDK Reference:
frame_iterator
Split videos into segments
Use video_splitter to divide videos into smaller clips.
from pixeltable.functions.video import video_splitter
segments = pxt.create_view(
'split_demo.segments',
videos,
iterator=video_splitter(videos.video, duration=5.0, min_segment_duration=1.0)
)
segments.select(segments.segment_start, segments.segment_end, segments.video_segment).limit(3).collect()
Inserting rows into `segments`: 4 rows [00:00, 2046.00 rows/s]
video_splitter options:
duration — Duration of each segment in seconds
overlap — Overlap between segments in seconds
min_segment_duration — Drop last segment if shorter than this
SDK Reference:
video_splitter
Split strings into sentences
Use string_splitter to divide text into sentences.
from pixeltable.functions.string import string_splitter
texts = pxt.create_table('split_demo.texts', {'content': pxt.String})
texts.insert([{'content': 'AI data infrastructure simplifies ML workflows. Declarative pipelines update incrementally. This makes development faster and more maintainable.'}])
Created table ‘texts’.
Inserting rows into `texts`: 1 rows [00:00, 814.27 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 1 value computed.
sentences = pxt.create_view(
'split_demo.sentences',
texts,
iterator=string_splitter(texts.content, separators='sentence')
)
sentences.select(sentences.text).collect()
Inserting rows into `sentences`: 3 rows [00:00, 2225.49 rows/s]
SDK Reference:
string_splitter
Tile images for analysis
Use tile_iterator to divide large images into a grid of smaller tiles.
This is useful for processing high-resolution images that are too large
to analyze at once, or for running object detection on different
regions.
from pixeltable.functions.image import tile_iterator
images = pxt.create_table('split_demo.images', {'image': pxt.Image})
images.insert([{'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/pixeltable-logo-large.png'}])
Created table ‘images’.
Inserting rows into `images`: 1 rows [00:00, 825.81 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.
tiles = pxt.create_view(
'split_demo.tiles',
images,
iterator=tile_iterator(images.image, tile_size=(100, 100))
)
Inserting rows into `tiles`: 176 rows [00:00, 40174.01 rows/s]
tile_iterator options:
tile_size — Size of each tile as (width, height)
overlap — Overlap between adjacent tiles as (width, height)
SDK Reference:
tile_iterator
tiles.select(tiles.tile_coord, tiles.tile).sample(n=4).collect()
Split audio into chunks
Use audio_splitter to divide audio files into time-based chunks for
transcription or analysis.
from pixeltable.functions.audio import audio_splitter
audio = pxt.create_table('split_demo.audio', {'audio': pxt.Audio})
audio.insert([{'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/10-minute%20tour%20of%20Pixeltable.mp3'}])
Created table ‘audio’.
Inserting rows into `audio`: 1 rows [00:00, 777.01 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.
audio_chunks = pxt.create_view(
'split_demo.audio_chunks',
audio,
iterator=audio_splitter(audio.audio, chunk_duration_sec=30.0, overlap_sec=2.0)
)
audio_chunks.select(audio_chunks.start_time_sec, audio_chunks.end_time_sec).limit(5).collect()
Inserting rows into `audio_chunks`: 11 rows [00:00, 7493.48 rows/s]
audio_splitter options:
chunk_duration_sec — Duration of each chunk in seconds
overlap_sec — Overlap between chunks in seconds
min_chunk_duration_sec — Drop last chunk if shorter than this
SDK Reference:
audio_splitter
See also