Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Generate vector embeddings for text data to enable semantic search and similarity matching.

Problem

You need to convert text into vector embeddings for:
  • Semantic search (find similar documents)
  • RAG pipelines (retrieve relevant context)
  • Clustering and classification

Solution

What’s in this recipe:
  • Generate embeddings with OpenAI’s models
  • Store embeddings as computed columns
  • Use embeddings for similarity queries
You add an embedding column that automatically generates vectors for new rows. The embeddings are cached and only recomputed when the source text changes.

Setup

%pip install -qU pixeltable openai
import os
import getpass

if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
import pixeltable as pxt
from pixeltable.functions.openai import embeddings
# Create a fresh directory
pxt.drop_dir('embed_demo', force=True)
pxt.create_dir('embed_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘embed_demo’.
<pixeltable.catalog.dir.Dir at 0x14ee4fcd0>

Create table with embedding column

# Create table for documents
docs = pxt.create_table(
    'embed_demo.documents',
    {'title': pxt.String, 'content': pxt.String}
)
Created table ‘documents’.
# Add embedding column using OpenAI's text-embedding-3-small
docs.add_computed_column(
    embedding=embeddings(docs.content, model='text-embedding-3-small')
)
Added 0 column values with 0 errors.
No rows affected.

Insert documents

# Insert sample documents
sample_docs = [
    {'title': 'Python Basics', 'content': 'Python is a high-level programming language known for its clear syntax and readability.'},
    {'title': 'Machine Learning', 'content': 'Machine learning is a subset of AI that enables systems to learn from data.'},
    {'title': 'Web Development', 'content': 'Web development involves building websites and web applications using HTML, CSS, and JavaScript.'},
    {'title': 'Data Science', 'content': 'Data science combines statistics, programming, and domain expertise to extract insights from data.'},
    {'title': 'Cloud Computing', 'content': 'Cloud computing provides on-demand computing resources over the internet.'},
]

docs.insert(sample_docs)
Inserting rows into `documents`: 5 rows [00:00, 553.22 rows/s]
Inserted 5 rows with 0 errors.
5 rows inserted, 15 values computed.
# View documents with embeddings (showing first 5 dimensions)
result = docs.select(docs.title, docs.embedding).collect()

Query by similarity

Find documents similar to a query by creating an embedding index:
# Add embedding index for semantic search
docs.add_embedding_index(
    column="content",
    string_embed=embeddings.using(model="text-embedding-3-small")
)
# Search for similar documents
sim = docs.content.similarity("artificial intelligence applications")
results = (
    docs.where(sim > 0.2)
    .order_by(sim, asc=False)
    .limit(3)
    .select(docs.title, docs.content, sim=sim)
)
results.collect()

Explanation

OpenAI embedding models:
Similarity metrics:
Key benefits of computed embedding columns:
  • Embeddings are generated automatically on insert
  • Results are cached—no re-computation on subsequent queries
  • Index enables fast similarity search at scale

See also