Semantic Search
Video
Build a multimodal video search workflow with Pixeltable
Building a Multimodal Video Search Workflow
Pixeltable lets you build comprehensive video search workflows combining both audio and visual content:
- Process both audio and visual content
- Query your knowledge base by text or visual concepts
1
Install Dependencies
Define Your Workflow
Create table.py
:
Use Your Workflow
Create app.py
:
What Makes This Different?
True Multimodal Processing
Process both audio and visual content from the same videos:
AI-Powered Frame Analysis
Automatic image description using vision models:
Unified Embedding Space
Use the same embedding model for both text and image descriptions:
Dual Search Capabilities
Search independently across audio or visual content:
Workflow Components
Video Processing
Video Processing
Extracts both audio and visual content:
- Video file ingestion from URLs or local files
- Automatic audio extraction with format selection
- Frame extraction at configurable frame rates
- Preserves timestamps for accurate retrieval
Visual Content Analysis
Visual Content Analysis
Analyzes video frames with AI:
- Extracts frames at 1 frame per second (configurable)
- Generates natural language descriptions of each frame
- Creates semantic embeddings of visual content
- Enables search by visual concepts
Audio Processing
Audio Processing
Handles audio for efficient transcription:
- Smart chunking to optimize transcription
- Configurable chunk duration (30 sec default)
- Overlap between chunks (2 sec default)
- Minimum chunk threshold (5 sec default)
Speech-to-Text
Speech-to-Text
Uses OpenAI’s Whisper for transcription:
- High-quality speech recognition
- Multiple language support
- Sentence-level segmentation
- Configurable model selection
Vector Search
Vector Search
Implements unified embedding space:
- Same embedding model for both modalities
- High-quality E5 vector representations
- Fast similarity search across content types
- Configurable top-k retrieval