Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Twelve Labs provides multimodal embeddings that project text, images, audio, and video into the same semantic space. This enables true cross-modal search - the most powerful feature of this integration. What makes this special? You can search a video index using any modality:
This notebook demonstrates this cross-modal capability with video, then shows how to apply the same embeddings to other modalities.

Prerequisites

Setup

%pip install -qU pixeltable twelvelabs
import os
import getpass

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')
import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed

# Create a fresh directory for our demo
pxt.drop_dir('twelvelabs_demo', force=True)
pxt.create_dir('twelvelabs_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘twelvelabs_demo’.
<pixeltable.catalog.dir.Dir at 0x31c5da290>
Let’s index a video and search it using text, images, audio, and other videos - all against the same index.

Create Video Table and Index

from pixeltable.functions.video import video_splitter

# Create a table for videos
video_t = pxt.create_table('twelvelabs_demo.videos', {'video': pxt.Video})

# Insert a sample video
video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4'
video_t.insert([{'video': video_url}])
Created table ‘videos’.
Inserted 1 row with 0 errors in 2.35 s (0.42 rows/s)
1 row inserted.
# Create a view that segments the video into searchable chunks
# Twelve Labs requires minimum 4 second segments
video_chunks = pxt.create_view(
    'twelvelabs_demo.video_chunks',
    video_t,
    iterator=video_splitter(
        video=video_t.video,
        duration=5.0,
        min_segment_duration=4.0
    )
)

# Add embedding index for cross-modal search
video_chunks.add_embedding_index(
    'video_segment',
    embedding=embed.using(model_name='marengo3.0')
)
Find video segments matching a text description.
sim = video_chunks.video_segment.similarity(string="pink")

video_chunks.order_by(sim, asc=False).limit(3).select(
    video_chunks.video_segment,
    score=sim
).collect()
Find video segments similar to an image.
image_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Screenshot.png'

sim = video_chunks.video_segment.similarity(image=image_query)

video_chunks.order_by(sim, asc=False).limit(2).select(
    video_chunks.video_segment,
    score=sim
).collect()
Find video segments similar to another video clip.
video_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Video-Extract.mp4'

sim = video_chunks.video_segment.similarity(video=video_query)

video_chunks.order_by(sim, asc=False).limit(2).select(
    video_chunks.video_segment,
    score=sim
).collect()
Find video segments with similar audio/speech content.
audio_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Audio-Extract.m4a'

sim = video_chunks.video_segment.similarity(audio=audio_query)

video_chunks.order_by(sim, asc=False).limit(2).select(
    video_chunks.video_segment,
    score=sim
).collect()

Embedding Options

For video embeddings, you can focus on specific aspects: - 'visual' - Focus on what you see - 'audio' - Focus on what you hear
  • 'transcription' - Focus on what is said
# Add a visual-only embedding column
video_chunks.add_computed_column(
    visual_embedding=embed(
        video_chunks.video_segment,
        model_name='marengo3.0',
        embedding_option=['visual']
    )
)

video_chunks.select(
    video_chunks.video_segment,
    video_chunks.visual_embedding
).limit(2).collect()
Added 51 column values with 0 errors in 17.13 s (2.98 rows/s)

Other Modalities: Text, Images, and Documents

Twelve Labs embeddings also work for text, images, and documents. Here’s a compact example showing multiple embedding indexes on a single table.
# Create a multimodal content table
content_t = pxt.create_table(
    'twelvelabs_demo.content',
    {
        'title': pxt.String,
        'description': pxt.String,
        'thumbnail': pxt.Image
    }
)

# Add embedding index on text column
content_t.add_embedding_index(
    'description',
    embedding=embed.using(model_name='marengo3.0')
)

# Add embedding index on image column
content_t.add_embedding_index(
    'thumbnail',
    embedding=embed.using(model_name='marengo3.0')
)
Created table ‘content’.
# Insert sample content
content_t.insert([
    {
        'title': 'Beach Sunset',
        'description': 'A beautiful sunset over the ocean with palm trees.',
        'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg'
    },
    {
        'title': 'Mountain Hiking',
        'description': 'Hikers climbing a steep mountain trail with scenic views.',
        'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg'
    },
    {
        'title': 'City Street',
        'description': 'Busy urban street with cars and pedestrians.',
        'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000042.jpg'
    },
    {
        'title': 'Wildlife Safari',
        'description': 'Elephants and zebras on the African savanna.',
        'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000061.jpg'
    }
])
Inserted 4 rows with 0 errors in 1.15 s (3.47 rows/s)
4 rows inserted.
# Search by text description
sim = content_t.description.similarity(string="outdoor nature adventure")

content_t.order_by(sim, asc=False).limit(2).select(
    content_t.title,
    content_t.description,
    score=sim
).collect()
# Search by image similarity
query_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000001.jpg'

sim = content_t.thumbnail.similarity(image=query_image)

content_t.order_by(sim, asc=False).limit(2).select(
    content_t.title,
    content_t.thumbnail,
    score=sim
).collect()
# Cross-modal: Search images using text!
sim = content_t.thumbnail.similarity(string="shoe rack")

content_t.order_by(sim, asc=False).limit(2).select(
    content_t.title,
    content_t.thumbnail,
    score=sim
).collect()

Summary

Twelve Labs + Pixeltable enables:
  • Cross-modal search: Query video with text, images, audio, or other videos
  • Multiple indexes per table: Add embedding indexes on different columns
  • Embedding options: Focus on visual, audio, or transcription aspects
  • All modalities: Text, images, audio, video, and documents

Learn More