Learn more about embedding/vector indexes with this in-depth guide.

What are Embedding/Vector Indexes?

Embedding indexes let you search your data based on meaning, not just keywords. They work with all kinds of content - text, images, audio, video, and documents - making it easy to build powerful search systems.

Multimodal Search Examples

Pixeltable makes it easy to build semantic search for different media types:

How Pixeltable Makes Embeddings Easy

  • No infrastructure headaches - embeddings are managed automatically
  • Works with any media type - text, images, audio, video, or documents
  • Updates automatically - when data changes, embeddings update too
  • Compatible with your favorite models - use Hugging Face, OpenAI, or your custom models

Phase 1: Setup Embeddings Model and Index

The setup phase defines your schema and creates embedding indexes.

pip install pixeltable sentence-transformers
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer

# Create a directory to organize data (optional)
pxt.drop_dir('knowledge_base', force=True)
pxt.create_dir('knowledge_base')

# Create table
docs = pxt.create_table(
    "knowledge_base.documents",
    {
        "content": pxt.String,
        "metadata": pxt.Json
    }
)

# Create embedding index
embed_model = sentence_transformer.using(
    model_id="intfloat/e5-large-v2"
)
docs.add_embedding_index(
    column='content',
    string_embed=embed_model
)

Supported Index Options

Similarity Metrics

# Available metrics:
docs.add_embedding_index(
    column='content',
    metric='cosine'  # Default
    # Other options:
    # metric='ip'    # Inner product
    # metric='l2'    # L2 distance
)

Index Configuration

# Optional parameters
docs.add_embedding_index(
    column='content',
    idx_name='custom_name',  # Optional name
    string_embed=embed_model,
    image_embed=img_model,   # For image columns
)

Phase 2: Insert

The insert phase populates your indexes with data. Pixeltable automatically computes embeddings and maintains index consistency.

# Single insertion
docs.insert([
    {
        "content": "Your document text here",
        "metadata": {"source": "web", "category": "tech"}
    }
])

# Batch insertion
docs.insert([
    {
        "content": "First document",
        "metadata": {"source": "pdf", "category": "science"}
    },
    {
        "content": "Second document", 
        "metadata": {"source": "web", "category": "news"}
    }
])

# Image insertion
image_urls = [
    'https://example.com/image1.jpg',
    'https://example.com/image2.jpg'
]
images.insert({'image': url} for url in image_urls)

Large batch insertions are more efficient than multiple single insertions as they reduce the number of embedding computations.

Phase 3: Query

The query phase allows you to search your indexed content using the similarity() function.

Management Operations

Drop Index

# Drop by name
docs.drop_embedding_index(idx_name='e5_idx')

# Drop by column (if single index)
docs.drop_embedding_index(column='content')

Update Index

# Indexes auto-update on changes
docs.update({
    'content': docs.content + ' Updated!'
})

Best Practices

  • Cache embedding models in production UDFs
  • Use batching for better performance
  • Consider index size vs. search speed tradeoffs
  • Monitor embedding computation time

Additional Resources