Vector Database

Learn more about embedding/vector indexes with this in-depth guide.

What are Embedding/Vector Indexes?

Embedding indexes let you search your data based on meaning, not just keywords. They work with all kinds of content - text, images, audio, video, and documents - making it easy to build powerful search systems.

Multimodal Search Examples

Pixeltable makes it easy to build semantic search for different media types:

Audio

Build semantic search for audio files and podcasts

Image

Create visual search engines with embedding models

Document

Search through PDFs and other document formats

Video

Find relevant content within video libraries

Website

Search across web content with semantic understanding

Text

Use metadata to search for long term memory for ai agents

How Pixeltable Makes Embeddings Easy

No infrastructure headaches - embeddings are managed automatically
Works with any media type - text, images, audio, video, or documents
Updates automatically - when data changes, embeddings update too
Compatible with your favorite models - use Hugging Face, OpenAI, or your custom models

Phase 1: Setup Embeddings Model and Index

The setup phase defines your schema and creates embedding indexes.

pip install pixeltable sentence-transformers

import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer

# Create a directory to organize data (optional)
pxt.drop_dir('knowledge_base', force=True)
pxt.create_dir('knowledge_base')

# Create table
docs = pxt.create_table(
    "knowledge_base.documents",
    {
        "content": pxt.String,
        "metadata": pxt.Json
    }
)

# Create embedding index
embed_model = sentence_transformer.using(
    model_id="intfloat/e5-large-v2"
)
docs.add_embedding_index(
    column='content',
    string_embed=embed_model
)

Supported Index Options

Similarity Metrics

# Available metrics:
docs.add_embedding_index(
    column='content',
    metric='cosine'  # Default
    # Other options:
    # metric='ip'    # Inner product
    # metric='l2'    # L2 distance
)

Index Configuration

# Optional parameters
docs.add_embedding_index(
    column='content',
    idx_name='custom_name',  # Optional name
    string_embed=embed_model,
    image_embed=img_model,   # For image columns
)

Phase 2: Insert

The insert phase populates your indexes with data. Pixeltable automatically computes embeddings and maintains index consistency.

# Single insertion
docs.insert([
    {
        "content": "Your document text here",
        "metadata": {"source": "web", "category": "tech"}
    }
])

# Batch insertion
docs.insert([
    {
        "content": "First document",
        "metadata": {"source": "pdf", "category": "science"}
    },
    {
        "content": "Second document", 
        "metadata": {"source": "web", "category": "news"}
    }
])

# Image insertion
image_urls = [
    'https://example.com/image1.jpg',
    'https://example.com/image2.jpg'
]
images.insert({'image': url} for url in image_urls)

Large batch insertions are more efficient than multiple single insertions as they reduce the number of embedding computations.

Phase 3: Query

The query phase allows you to search your indexed content using the similarity() function.

  sim = docs.content.similarity("what is the documentation")
  
  # Return top-k most similar documents
  results = (docs.order_by(sim, asc=False)
      .select(docs.content, docs.metadata, score=sim)
      .limit(10)
  )

  for i in results:
      print(f"Similarity: {i['score']:.3f}")
      print(f"Text: {i['content']}\n")

Direct Embedding Access

Pixeltable allows direct access to the raw embedding vectors through the .embedding() method. This feature lets you retrieve the actual vector representations that power similarity search.

# Access embeddings from a column with a single index
results = docs.select(
    docs.content,
    embedding=docs.content.embedding()
).limit(5)

# Access embeddings from a column with multiple indices
results = docs.select(
    docs.content,
    embedding=docs.content.embedding(idx='custom_idx_name')
).limit(5)

# Embeddings are returned as numpy arrays
import numpy as np
assert isinstance(results[0, 'embedding'], np.ndarray)

# You can also store embeddings in a computed column
docs.add_computed_column(
    embedding_copy=docs.content.embedding()
)

The .similarity() method cannot be used directly in computed columns
Embedding indices cannot be dropped if there are computed columns that depend on them
When a column has multiple embedding indices, you must specify which index to use with the idx parameter

Management Operations

Drop Index

# Drop by name
docs.drop_embedding_index(idx_name='e5_idx')

# Drop by column (if single index)
docs.drop_embedding_index(column='content')

Update Index

# Indexes auto-update on changes
docs.update({
    'content': docs.content + ' Updated!'
})

Best Practices

Cache embedding models in production UDFs
Use batching for better performance
Consider index size vs. search speed tradeoffs
Monitor embedding computation time

Additional Resources

API Documentation

Complete API reference

Multimodal Indexes

Examples of multimodal embedding indexes

Model Hub

Connect with your favorite Hugging Face models

Welcome to Pixeltable

Multimodal AI Datastore

Tutorials

Libraries

What are Embedding/Vector Indexes?

Multimodal Search Examples

Audio

Image

Document

Video

Website

Text

How Pixeltable Makes Embeddings Easy

Phase 1: Setup Embeddings Model and Index

Supported Index Options

Similarity Metrics

Index Configuration

Phase 2: Insert

Phase 3: Query

Direct Embedding Access

Management Operations

Drop Index

Update Index

Best Practices

Additional Resources

API Documentation

Multimodal Indexes

Model Hub

Welcome to Pixeltable

Multimodal AI Datastore

Tutorials

Libraries

​What are Embedding/Vector Indexes?

​Multimodal Search Examples

Audio

Image

Document

Video

Website

Text

​How Pixeltable Makes Embeddings Easy

​Phase 1: Setup Embeddings Model and Index

​Supported Index Options

Similarity Metrics

Index Configuration

​Phase 2: Insert

​Phase 3: Query

​Direct Embedding Access

​Management Operations

Drop Index

Update Index

​Best Practices

​Additional Resources

API Documentation

Multimodal Indexes

Model Hub

What are Embedding/Vector Indexes?

Multimodal Search Examples

How Pixeltable Makes Embeddings Easy

Phase 1: Setup Embeddings Model and Index

Supported Index Options

Phase 2: Insert

Phase 3: Query

Direct Embedding Access

Management Operations

Best Practices

Additional Resources