Learn more about embedding/vector indexes with this in-depth guide.

What are Embedding Indexes?

Embedding indexes in Pixeltable enable semantic search and similarity-based retrieval across different data modalities. They store vector representations of your content, allowing you to find related items based on meaning rather than exact matching.

Unlike traditional indexes that search by keywords, embedding indexes capture the semantic essence of your data, making it possible to:

  • Find content with similar meaning even when using different words
  • Match content across different modalities (text-to-image, image-to-text)
  • Rank results based on semantic relevance
  • Build powerful retrieval systems for RAG applications

Pixeltable makes working with embeddings simple by:

  • Managing the lifecycle of embedding computations
  • Automatically updating indexes when data changes
  • Providing a unified interface for different embedding models
  • Supporting multiple index types on the same column
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer

# Create a table and add an embedding index
knowledge_base = pxt.create_table('kb', {
    'content': pxt.String,
    'category': pxt.String
})

# Add an embedding index using a pre-trained model
knowledge_base.add_embedding_index(
    column='content',
    string_embed=sentence_transformer.using(
        model_id='sentence-transformers/all-MiniLM-L12-v2'
    )
)

Overview

Embedding indexes in Pixeltable are:

  • Declarative: Define the index structure and embedding functions once
  • Maintainable: Pixeltable automatically keeps indexes up-to-date on changes
  • Flexible: Support multiple index types on the same column
  • Multimodal: Handle text, images, audio, and documents

In this guide, we’ll create a semantic search system for images and text. Make sure you have the required dependencies installed:

pip install pixeltable sentence_transformers

Phase 1: Setup

The setup phase defines your schema and creates embedding indexes.

import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer

# Create a directory to organize data (optional)
pxt.drop_dir('knowledge_base', force=True)
pxt.create_dir('knowledge_base')

# Create table
docs = pxt.create_table(
    path_str="knowledge_base.documents",
    {
        "content": pxt.String,
        "metadata": pxt.Json
    }
)

# Create embedding index
embed_model = sentence_transformer.using(
    model_id="intfloat/e5-large-v2"
)
docs.add_embedding_index(
    column='content',
    string_embed=embed_model
)

Supported Index Options

Similarity Metrics

# Available metrics:
docs.add_embedding_index(
    column='content',
    metric='cosine'  # Default
    # Other options:
    # metric='ip'    # Inner product
    # metric='l2'    # L2 distance
)

Index Configuration

# Optional parameters
docs.add_embedding_index(
    column='content',
    idx_name='custom_name',  # Optional name
    string_embed=embed_model,
    image_embed=img_model,   # For image columns
)

Phase 2: Insert

The insert phase populates your indexes with data. Pixeltable automatically computes embeddings and maintains index consistency.

# Single insertion
docs.insert({
    "content": "Your document text here",
    "metadata": {"source": "web", "category": "tech"}
})

# Batch insertion
docs.insert([
    {
        "content": "First document",
        "metadata": {"source": "pdf", "category": "science"}
    },
    {
        "content": "Second document", 
        "metadata": {"source": "web", "category": "news"}
    }
])

# Image insertion
image_urls = [
    'https://example.com/image1.jpg',
    'https://example.com/image2.jpg'
]
images.insert({'image': url} for url in image_urls)

Large batch insertions are more efficient than multiple single insertions as they reduce the number of embedding computations.

Phase 3: Query

The query phase allows you to search your indexed content using the similarity() function.

@pxt.query
def semantic_search(query: str, k: int = 5):
    # Calculate similarity scores
    sim = docs.content.similarity(query)
    
    # Return top-k most similar documents
    return (
        docs
        .order_by(sim, asc=False)
        .select(docs.content, docs.metadata, score=sim)
        .limit(k)
    )

# Use it
results = semantic_search("quantum computing")

Advanced Query Patterns

Management Operations

Drop Index

# Drop by name
docs.drop_embedding_index(idx_name='e5_idx')

# Drop by column (if single index)
docs.drop_embedding_index(column='content')

Update Index

# Indexes auto-update on changes
docs.update({
    'content': docs.content + ' Updated!'
})

Best Practices

  • Cache embedding models in production UDFs
  • Use batching for better performance
  • Consider index size vs. search speed tradeoffs
  • Monitor embedding computation time

Additional Resources