> ## Documentation Index > Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt > Use this file to discover all available pages before exploring further. # Iterators > Use Pixeltable iterators to split documents, video, audio, and images into row-level components for view-based downstream processing. ## What are iterators? Iterators in Pixeltable are specialized tools for processing and transforming media content. They efficiently break down large files into manageable chunks, enabling analysis at different granularities. Iterators work seamlessly with views to create virtual derived tables without duplicating storage. In Pixeltable, iterators: * Process media files incrementally to manage memory efficiently * Transform single records into multiple output records * Support various media types including documents, videos, images, and audio * Integrate with the view system for automated processing pipelines * Provide configurable parameters for fine-tuning output Iterators are particularly useful when: * Working with large media files that can't be processed at once * Building retrieval systems that require chunked content * Creating analysis pipelines for multimedia data * Implementing feature extraction workflows ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} import pixeltable as pxt from pixeltable.functions.document import document_splitter # Create a view using an iterator chunks = pxt.create_view( 'docs/chunks', documents_table, iterator=document_splitter( document=documents_table.document, separators='sentence,token_limit', limit=300 ) ) ``` ## Core concepts Split documents into chunks by headings, sentences, or token limits Extract frames at specified intervals or counts Divide images into overlapping or non-overlapping tiles Split audio files into time-based chunks with configurable overlap Iterators are powerful tools for processing large media files. They work seamlessly with Pixeltable's computed columns and versioning system. ## Available iterators ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from pixeltable.functions.document import document_splitter # Create view with document chunks chunks_view = pxt.create_view( 'docs/chunks', docs_table, iterator=document_splitter( document=docs_table.document, separators='sentence,token_limit', limit=500, metadata='title,heading' ) ) ``` ### Parameters * `separators`: Choose from 'heading', 'sentence', 'token\_limit', 'char\_limit', 'page' * `limit`: Maximum tokens/characters per chunk * `metadata`: Optional fields like 'title', 'heading', 'sourceline', 'page', 'bounding\_box' * `overlap`: Optional overlap between chunks ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from pixeltable.functions.video import frame_iterator # Extract frames at 1 FPS frames_view = pxt.create_view( 'videos/frames', videos_table, iterator=frame_iterator( video=videos_table.video, fps=1.0 ) ) # Extract exact number of frames (evenly spaced) frames_view = pxt.create_view( 'videos/sampled', videos_table, iterator=frame_iterator( video=videos_table.video, num_frames=10 # Extract 10 evenly-spaced frames ) ) # Extract only keyframes (I-frames) for efficient processing keyframes_view = pxt.create_view( 'videos/keyframes', videos_table, iterator=frame_iterator( video=videos_table.video, keyframes_only=True ) ) ``` ### Parameters * `fps`: Frames per second to extract (can be fractional) * `num_frames`: Exact number of frames to extract * `keyframes_only`: Extract only keyframes (I-frames) - efficient for quick video scanning * Only one of `fps`, `num_frames`, or `keyframes_only` can be specified ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from pixeltable.functions.video import video_splitter # Split video into 10-second segments segments_view = pxt.create_view( 'videos/segments', videos_table, iterator=video_splitter( video=videos_table.video, duration=10.0, min_segment_duration=1.0 ) ) ``` ### Parameters * `duration`: Duration of each segment in seconds * `overlap`: Overlap between segments in seconds * `min_segment_duration`: Drop last segment if shorter than this value ### Returns For each segment, yields: * `segment_start`: Start time of the segment in seconds * `segment_end`: End time of the segment in seconds * `video_segment`: The video segment file ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from pixeltable.functions.string import string_splitter # Split text into sentences sentences_view = pxt.create_view( 'texts/sentences', texts_table, iterator=string_splitter( text=texts_table.content, separators='sentence' ) ) ``` ### Parameters * `separators`: Choose from 'sentence' (requires spacy) ### Returns For each chunk, yields: * `text`: The text chunk ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from pixeltable.functions.image import tile_iterator # Create tiles with overlap tiles_view = pxt.create_view( 'images/tiles', images_table, iterator=tile_iterator( image=images_table.image, tile_size=(224, 224), # Width, Height overlap=(32, 32) # Horizontal, Vertical overlap ) ) ``` ### Parameters * `tile_size`: Tuple of (width, height) for each tile * `overlap`: Optional tuple for overlap between tiles ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from pixeltable.functions.audio import audio_splitter # Split audio into chunks chunks_view = pxt.create_view( 'audio/chunks', audio_table, iterator=audio_splitter( audio=audio_table.audio, duration=30.0, # Split into 30-second chunks overlap=2.0, # 2-second overlap between chunks min_segment_duration=5.0 # Drop last chunk if < 5 seconds ) ) ``` ### Parameters * `duration` (float): Duration of each audio chunk in seconds * `overlap` (float, default: 0.0): Overlap duration between consecutive chunks in seconds * `min_segment_duration` (float, default: 0.0): Minimum duration threshold - the last chunk will be dropped if it's shorter than this value ### Returns For each chunk, yields: * `start_time_sec`: Start time of the chunk in seconds * `end_time_sec`: End time of the chunk in seconds * `audio_chunk`: The audio chunk as pxt.Audio type ### Notes * If the input contains no audio, no chunks are yielded * The audio file is processed efficiently with proper codec handling * Supports various audio formats including MP3, AAC, Vorbis, Opus, FLAC ## Common use cases Split documents for: * RAG systems * Text analysis * Content extraction Extract frames for: * Object detection * Scene classification * Activity recognition Create tiles for: * High-resolution analysis * Object detection * Segmentation tasks Split audio for: * Speech recognition * Sound classification * Audio feature extraction ## Example workflows ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create document chunks chunks = pxt.create_view( 'rag/chunks', docs_table, iterator=document_splitter( document=docs_table.document, separators='sentence,token_limit', limit=500 ) ) # Add embeddings chunks.add_embedding_index( 'text', string_embed=sentence_transformer.using( model_id='all-mpnet-base-v2' ) ) ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Extract frames at 1 FPS frames = pxt.create_view( 'detection/frames', videos_table, iterator=frame_iterator( video=videos_table.video, fps=1.0 ) ) # Add object detection frames.add_computed_column(detections=detect_objects(frames.frame)) ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Split long audio files chunks = pxt.create_view( 'audio/chunks', audio_table, iterator=audio_splitter( audio=audio_table.audio, duration=30.0 ) ) # Add transcription chunks.add_computed_column(text=whisper_transcribe(chunks.audio_chunk)) ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from pixeltable.functions.video import make_video # Extract frames at 1 FPS frames = pxt.create_view( 'video/frames', videos_table, iterator=frame_iterator( video=videos_table.video, fps=1.0 ) ) # Process frames (e.g., apply a filter) frames.add_computed_column(processed=frames.frame.filter('BLUR')) # Create new videos from processed frames processed_videos = frames.select( frames.video_id, make_video(frames.pos, frames.processed) # Default fps is 25 ).group_by(frames.video_id).collect() ``` ## Best practices * Use appropriate chunk sizes * Consider overlap requirements * Monitor memory usage with large files * Balance chunk size vs. processing time * Use batch processing when possible * Cache intermediate results ## Tips & tricks When using `token_limit` with `document_splitter`, ensure the limit accounts for any model context windows in your pipeline. ## Custom iterators with `@pxt.iterator` You can create your own iterators using the `@pxt.iterator` decorator on a Python generator function. This is the simplest way to define a custom iterator that splits one row into many. ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from typing import Iterator, TypedDict import pixeltable as pxt class WordRow(TypedDict): word: str position: int @pxt.iterator def word_iterator(text: str) -> Iterator[WordRow]: for i, word in enumerate(text.split()): yield WordRow(word=word, position=i) # Use as a view iterator words_view = pxt.create_view( 'text/words', text_table, iterator=word_iterator(text_table.content) ) ``` Use `unstored_cols` to mark columns that should not be persisted: ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} from typing import Iterator, TypedDict import pixeltable as pxt class FrameRow(TypedDict): frame: pxt.Image timestamp: float @pxt.iterator(unstored_cols=['frame']) def my_frame_extractor(video: pxt.Video) -> Iterator[FrameRow]: # Custom frame extraction logic ... ``` Step-by-step guide to building custom iterators ## Additional resources All built-in iterators Chunk documents for RAG Extract video frames