Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Create a searchable knowledge base that finds content by meaning, not just keywords.

Problem

You have a collection of text content (articles, notes, documentation) and need to find relevant items based on meaning. Keyword search fails when users phrase queries differently from the source text:

Solution

What’s in this recipe:
  • Create a text table with embeddings
  • Search by semantic similarity
  • Combine with metadata filters
You add an embedding index to your text column. Pixeltable automatically generates embeddings for each row and enables similarity search.

Setup

%pip install -qU pixeltable sentence-transformers
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer

Create knowledge base

# Create a fresh directory
pxt.drop_dir('search_demo', force=True)
pxt.create_dir('search_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘search_demo’.
<pixeltable.catalog.dir.Dir at 0x14208ca10>
# Create table with content and metadata
kb = pxt.create_table('search_demo.articles', {
    'title': pxt.String,
    'content': pxt.String,
    'category': pxt.String
})
Created table ‘articles’.
# Insert sample content
kb.insert([
    {'title': 'Debugging best practices',
     'content': 'Use logging, breakpoints, and unit tests to identify and fix issues in your code.',
     'category': 'engineering'},
    {'title': 'Machine learning model optimization',
     'content': 'Improve training efficiency with batch normalization, learning rate schedules, and early stopping.',
     'category': 'ml'},
    {'title': 'Production infrastructure setup',
     'content': 'Deploy applications using containers, load balancers, and automated scaling.',
     'category': 'devops'},
    {'title': 'API design principles',
     'content': 'Create RESTful endpoints with proper versioning, authentication, and error handling.',
     'category': 'engineering'},
])
Inserting rows into `articles`: 4 rows [00:00, 577.69 rows/s]
Inserted 4 rows with 0 errors.
4 rows inserted, 12 values computed.
Create an embedding index on the content column:
# Add embedding index
kb.add_embedding_index(
    column='content',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2')
)

Search by meaning

Find content semantically similar to your query:
# Search by meaning
query = "how to fix bugs"
sim = kb.content.similarity(query)

results = (
    kb
    .order_by(sim, asc=False)
    .select(kb.title, kb.content, score=sim)
    .limit(2)
)
results.collect()

Filter by metadata

Combine semantic search with metadata filters:
# Search within a specific category
query = "best practices"
sim = kb.content.similarity(query)

results = (
    kb
    .where(kb.category == 'engineering')  # Filter first
    .order_by(sim, asc=False)
    .select(kb.title, kb.category, score=sim)
    .limit(2)
)
results.collect()

Explanation

How similarity search works:
  1. Your query is converted to an embedding vector
  2. Pixeltable finds the most similar vectors in the index
  3. Results are ranked by cosine similarity (0 to 1)
Embedding models:
New content is indexed automatically: When you insert new rows, embeddings are generated without extra code.

See also