This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Twelve Labs provides multimodal embeddings that project text, images,
audio, and video into the same semantic space. This enables true
cross-modal search - the most powerful feature of this integration.
What makes this special? You can search a video index using any
modality:
This notebook demonstrates this cross-modal capability with video, then
shows how to apply the same embeddings to other modalities.
Prerequisites
Setup
%pip install -qU pixeltable twelvelabs
import os
import getpass
if 'TWELVELABS_API_KEY' not in os.environ:
os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')
import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed
# Create a fresh directory for our demo
pxt.drop_dir('twelvelabs_demo', force=True)
pxt.create_dir('twelvelabs_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘twelvelabs_demo’.
<pixeltable.catalog.dir.Dir at 0x31c5da290>
Cross-Modal Video Search
Let’s index a video and search it using text, images, audio, and other
videos - all against the same index.
Create Video Table and Index
from pixeltable.functions.video import video_splitter
# Create a table for videos
video_t = pxt.create_table('twelvelabs_demo.videos', {'video': pxt.Video})
# Insert a sample video
video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4'
video_t.insert([{'video': video_url}])
Created table ‘videos’.
Inserted 1 row with 0 errors in 2.35 s (0.42 rows/s)
1 row inserted.
# Create a view that segments the video into searchable chunks
# Twelve Labs requires minimum 4 second segments
video_chunks = pxt.create_view(
'twelvelabs_demo.video_chunks',
video_t,
iterator=video_splitter(
video=video_t.video,
duration=5.0,
min_segment_duration=4.0
)
)
# Add embedding index for cross-modal search
video_chunks.add_embedding_index(
'video_segment',
embedding=embed.using(model_name='marengo3.0')
)
Text to Video Search
Find video segments matching a text description.
sim = video_chunks.video_segment.similarity(string="pink")
video_chunks.order_by(sim, asc=False).limit(3).select(
video_chunks.video_segment,
score=sim
).collect()
Image to Video Search
Find video segments similar to an image.
image_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Screenshot.png'
sim = video_chunks.video_segment.similarity(image=image_query)
video_chunks.order_by(sim, asc=False).limit(2).select(
video_chunks.video_segment,
score=sim
).collect()
Video to Video Search
Find video segments similar to another video clip.
video_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Video-Extract.mp4'
sim = video_chunks.video_segment.similarity(video=video_query)
video_chunks.order_by(sim, asc=False).limit(2).select(
video_chunks.video_segment,
score=sim
).collect()
Audio to Video Search
Find video segments with similar audio/speech content.
audio_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Audio-Extract.m4a'
sim = video_chunks.video_segment.similarity(audio=audio_query)
video_chunks.order_by(sim, asc=False).limit(2).select(
video_chunks.video_segment,
score=sim
).collect()
Embedding Options
For video embeddings, you can focus on specific aspects: - 'visual' -
Focus on what you see - 'audio' - Focus on what you hear
'transcription' - Focus on what is said
# Add a visual-only embedding column
video_chunks.add_computed_column(
visual_embedding=embed(
video_chunks.video_segment,
model_name='marengo3.0',
embedding_option=['visual']
)
)
video_chunks.select(
video_chunks.video_segment,
video_chunks.visual_embedding
).limit(2).collect()
Added 51 column values with 0 errors in 17.13 s (2.98 rows/s)
Other Modalities: Text, Images, and Documents
Twelve Labs embeddings also work for text, images, and documents. Here’s
a compact example showing multiple embedding indexes on a single
table.
# Create a multimodal content table
content_t = pxt.create_table(
'twelvelabs_demo.content',
{
'title': pxt.String,
'description': pxt.String,
'thumbnail': pxt.Image
}
)
# Add embedding index on text column
content_t.add_embedding_index(
'description',
embedding=embed.using(model_name='marengo3.0')
)
# Add embedding index on image column
content_t.add_embedding_index(
'thumbnail',
embedding=embed.using(model_name='marengo3.0')
)
Created table ‘content’.
# Insert sample content
content_t.insert([
{
'title': 'Beach Sunset',
'description': 'A beautiful sunset over the ocean with palm trees.',
'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg'
},
{
'title': 'Mountain Hiking',
'description': 'Hikers climbing a steep mountain trail with scenic views.',
'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg'
},
{
'title': 'City Street',
'description': 'Busy urban street with cars and pedestrians.',
'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000042.jpg'
},
{
'title': 'Wildlife Safari',
'description': 'Elephants and zebras on the African savanna.',
'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000061.jpg'
}
])
Inserted 4 rows with 0 errors in 1.15 s (3.47 rows/s)
4 rows inserted.
# Search by text description
sim = content_t.description.similarity(string="outdoor nature adventure")
content_t.order_by(sim, asc=False).limit(2).select(
content_t.title,
content_t.description,
score=sim
).collect()
# Search by image similarity
query_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000001.jpg'
sim = content_t.thumbnail.similarity(image=query_image)
content_t.order_by(sim, asc=False).limit(2).select(
content_t.title,
content_t.thumbnail,
score=sim
).collect()
# Cross-modal: Search images using text!
sim = content_t.thumbnail.similarity(string="shoe rack")
content_t.order_by(sim, asc=False).limit(2).select(
content_t.title,
content_t.thumbnail,
score=sim
).collect()
Summary
Twelve Labs + Pixeltable enables:
- Cross-modal search: Query video with text, images, audio, or
other videos
- Multiple indexes per table: Add embedding indexes on different
columns
- Embedding options: Focus on visual, audio, or transcription
aspects
- All modalities: Text, images, audio, video, and documents
Learn More