Working with Twelve Labs in Pixeltable

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Twelve Labs provides multimodal embeddings that project text, images, audio, and video into the same semantic space. This enables true cross-modal search - the most powerful feature of this integration. What makes this special? You can search a video index using any modality:

This notebook demonstrates this cross-modal capability with video, then shows how to apply the same embeddings to other modalities.

Prerequisites

A Twelve Labs account with an API key (playground.twelvelabs.io)
Audio and video must be at least 4 seconds long

Setup

%pip install -qU pixeltable twelvelabs

import getpass
import os

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass(
        'Enter your Twelve Labs API key: '
    )

import pixeltable as pxt
import pixeltable.functions as pxtf

# Create a fresh directory for our demo
pxt.drop_dir('twelvelabs_demo', force=True)
pxt.create_dir('twelvelabs_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘twelvelabs_demo’.
<pixeltable.catalog.dir.Dir at 0x138c2c150>

Let’s index a video and search it using text, images, audio, and other videos - all against the same index.

Create Video Table and Index

# Create a table for videos
video_t = pxt.create_table('twelvelabs_demo/videos', {'video': pxt.Video})

# Insert a sample video
video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4'
video_t.insert([{'video': video_url}])

Created table ‘videos’.
Inserted 1 row with 0 errors in 1.60 s (0.63 rows/s)
1 row inserted.

# Create a view that segments the video into searchable chunks
# Twelve Labs requires minimum 4 second segments
video_chunks = pxt.create_view(
    'twelvelabs_demo/video_chunks',
    video_t,
    iterator=pxtf.video.video_splitter(
        video=video_t.video, duration=5.0, min_segment_duration=4.0
    ),
)

# Add embedding index for cross-modal search
video_chunks.add_embedding_index(
    'video_segment',
    embedding=pxtf.twelvelabs.embed.using(model_name='marengo3.0'),
)

Let’s look at the index we just added in the table metadata:

video_chunks

The iterator created a larger table from our single video:

video_chunks.count()

Text to Video Search

Find video segments matching a text description.

sim = video_chunks.video_segment.similarity(string='pink')

video_chunks.order_by(sim, asc=False).limit(3).select(
    video_chunks.video_segment, score=sim
).collect()

Image to Video Search

Find video segments similar to an image.

image_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Screenshot.png'

sim = video_chunks.video_segment.similarity(image=image_query)

video_chunks.order_by(sim, asc=False).limit(2).select(
    video_chunks.video_segment, score=sim
).collect()

Video to Video Search

Find video segments similar to another video clip.

video_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Video-Extract.mp4'

sim = video_chunks.video_segment.similarity(video=video_query)

video_chunks.order_by(sim, asc=False).limit(2).select(
    video_chunks.video_segment, score=sim
).collect()

Audio to Video Search

Find video segments with similar audio/speech content.

audio_query = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness-Audio-Extract.m4a'

sim = video_chunks.video_segment.similarity(audio=audio_query)

video_chunks.order_by(sim, asc=False).limit(2).select(
    video_chunks.video_segment, score=sim
).collect()

Embedding Options

For video embeddings, you can focus on specific aspects:

'visual' - Focus on what you see
'audio' - Focus on what you hear
'transcription' - Focus on what is said

# Add a visual-only embedding column
video_chunks.add_computed_column(
    visual_embedding=pxtf.twelvelabs.embed(
        video_chunks.video_segment,
        model_name='marengo3.0',
        embedding_option=['visual'],
    )
)

video_chunks.select(
    video_chunks.video_segment, video_chunks.visual_embedding
).limit(2).collect()

Added 51 column values with 0 errors in 19.81 s (2.57 rows/s)

Other Modalities: Text, Images, and Documents

Twelve Labs embeddings also work for text, images, and documents. Here’s a compact example showing multiple embedding indexes on a single table.

# Create a multimodal content table
content_t = pxt.create_table(
    'twelvelabs_demo/content',
    {
        'title': pxt.String,
        'description': pxt.String,
        'thumbnail': pxt.Image,
    },
)

# Add computed column combining title and description
content_t.add_computed_column(
    text_content=content_t.title + '. ' + content_t.description
)

# Add embedding index on combined text column
content_t.add_embedding_index(
    'text_content',
    embedding=pxtf.twelvelabs.embed.using(model_name='marengo3.0'),
)

# Add embedding index on image column
content_t.add_embedding_index(
    'thumbnail',
    embedding=pxtf.twelvelabs.embed.using(model_name='marengo3.0'),
)

Created table ‘content’.
Added 0 column values with 0 errors in 0.01 s

# Insert sample content
content_t.insert(
    [
        {
            'title': 'Beach Sunset',
            'description': 'A beautiful sunset over the ocean with palm trees.',
            'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg',
        },
        {
            'title': 'Mountain Hiking',
            'description': 'Hikers climbing a steep mountain trail with scenic views.',
            'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg',
        },
        {
            'title': 'City Street',
            'description': 'Busy urban street with cars and pedestrians.',
            'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000042.jpg',
        },
        {
            'title': 'Wildlife Safari',
            'description': 'Elephants and zebras on the African savanna.',
            'thumbnail': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000061.jpg',
        },
    ]
)

Inserted 4 rows with 0 errors in 1.97 s (2.03 rows/s)
4 rows inserted.

We can see the two indexes we added in the schema:

content_t

# Search by text description
sim = content_t.text_content.similarity(string='outdoor nature adventure')

content_t.order_by(sim, asc=False).limit(2).select(
    content_t.title, content_t.text_content, score=sim
).collect()

# Search by image similarity
query_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000001.jpg'

sim = content_t.thumbnail.similarity(image=query_image)

content_t.order_by(sim, asc=False).limit(2).select(
    content_t.title, content_t.thumbnail, score=sim
).collect()

# Cross-modal: Search images using text!
sim = content_t.thumbnail.similarity(string='shoe rack')

content_t.order_by(sim, asc=False).limit(2).select(
    content_t.title, content_t.thumbnail, score=sim
).collect()

Summary

Twelve Labs + Pixeltable enables:

Cross-modal search: Query video with text, images, audio, or other videos
Multiple indexes per table: Add embedding indexes on different columns
Embedding options: Focus on visual, audio, or transcription aspects
All modalities: Text, images, audio, video, and documents

Welcome to Pixeltable

Core Concepts

How-To

Prerequisites

Setup

Create Video Table and Index

Text to Video Search

Image to Video Search

Video to Video Search

Audio to Video Search

Embedding Options

Other Modalities: Text, Images, and Documents

Summary

Learn More

Welcome to Pixeltable

Core Concepts

How-To

​Prerequisites

​Setup

​Cross-Modal Video Search

​Create Video Table and Index

​Text to Video Search

​Image to Video Search

​Video to Video Search

​Audio to Video Search

​Embedding Options

​Other Modalities: Text, Images, and Documents

​Summary

​Learn More

Prerequisites

Setup

Cross-Modal Video Search

Create Video Table and Index

Text to Video Search

Image to Video Search

Video to Video Search

Audio to Video Search

Embedding Options

Other Modalities: Text, Images, and Documents

Summary

Learn More