Learn more about embedding/vector indexes with this in-depth guide.

What are Embedding/Vector Indexes?

Embedding indexes let you search your data based on meaning, not just keywords. They work with all kinds of content - text, images, audio, video, and documents - making it easy to build powerful search systems.

Multimodal Search Examples

Pixeltable makes it easy to build semantic search for different media types:

How Pixeltable Makes Embeddings Easy

  • No infrastructure headaches - embeddings are managed automatically
  • Works with any media type - text, images, audio, video, or documents
  • Updates automatically - when data changes, embeddings update too
  • Compatible with your favorite models - use Hugging Face, OpenAI, or your custom models

Phase 1: Setup Embeddings Model and Index

The setup phase defines your schema and creates embedding indexes.

pip install pixeltable sentence-transformers
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer

# Create a directory to organize data (optional)
pxt.drop_dir('knowledge_base', force=True)
pxt.create_dir('knowledge_base')

# Create table
docs = pxt.create_table(
    "knowledge_base.documents",
    {
        "content": pxt.String,
        "metadata": pxt.Json
    }
)

# Create embedding index
embed_model = sentence_transformer.using(
    model_id="intfloat/e5-large-v2"
)
docs.add_embedding_index(
    column='content',
    string_embed=embed_model
)

Supported Index Options

Similarity Metrics

# Available metrics:
docs.add_embedding_index(
    column='content',
    metric='cosine'  # Default
    # Other options:
    # metric='ip'    # Inner product
    # metric='l2'    # L2 distance
)

Index Configuration

# Optional parameters
docs.add_embedding_index(
    column='content',
    idx_name='custom_name',  # Optional name
    string_embed=embed_model,
    image_embed=img_model,   # For image columns
)

Phase 2: Insert

The insert phase populates your indexes with data. Pixeltable automatically computes embeddings and maintains index consistency.

# Single insertion
docs.insert([
    {
        "content": "Your document text here",
        "metadata": {"source": "web", "category": "tech"}
    }
])

# Batch insertion
docs.insert([
    {
        "content": "First document",
        "metadata": {"source": "pdf", "category": "science"}
    },
    {
        "content": "Second document", 
        "metadata": {"source": "web", "category": "news"}
    }
])

# Image insertion
image_urls = [
    'https://example.com/image1.jpg',
    'https://example.com/image2.jpg'
]
images.insert({'image': url} for url in image_urls)

Large batch insertions are more efficient than multiple single insertions as they reduce the number of embedding computations.

Phase 3: Query

The query phase allows you to search your indexed content using the similarity() function.

Direct Embedding Access

Pixeltable allows direct access to the raw embedding vectors through the .embedding() method. This feature lets you retrieve the actual vector representations that power similarity search.

# Access embeddings from a column with a single index
results = docs.select(
    docs.content,
    embedding=docs.content.embedding()
).limit(5)

# Access embeddings from a column with multiple indices
results = docs.select(
    docs.content,
    embedding=docs.content.embedding(idx='custom_idx_name')
).limit(5)

# Embeddings are returned as numpy arrays
import numpy as np
assert isinstance(results[0, 'embedding'], np.ndarray)

# You can also store embeddings in a computed column
docs.add_computed_column(
    embedding_copy=docs.content.embedding()
)
  • The .similarity() method cannot be used directly in computed columns
  • Embedding indices cannot be dropped if there are computed columns that depend on them
  • When a column has multiple embedding indices, you must specify which index to use with the idx parameter

Management Operations

Drop Index

# Drop by name
docs.drop_embedding_index(idx_name='e5_idx')

# Drop by column (if single index)
docs.drop_embedding_index(column='content')

Update Index

# Indexes auto-update on changes
docs.update({
    'content': docs.content + ' Updated!'
})

Best Practices

  • Cache embedding models in production UDFs
  • Use batching for better performance
  • Consider index size vs. search speed tradeoffs
  • Monitor embedding computation time

Additional Resources