This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Create a searchable knowledge base that finds content by meaning, not
just keywords.
Problem
You have a collection of text content (articles, notes, documentation)
and need to find relevant items based on meaning.
Keyword search fails when users phrase queries differently from the
source text:
Solution
What’s in this recipe:
- Create a text table with embeddings
- Search by semantic similarity
- Combine with metadata filters
You add an embedding index to your text column. Pixeltable automatically
generates embeddings for each row and enables similarity search.
Setup
%pip install -qU pixeltable sentence-transformers
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer
Create knowledge base
# Create a fresh directory
pxt.drop_dir('search_demo', force=True)
pxt.create_dir('search_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘search_demo’.
<pixeltable.catalog.dir.Dir at 0x14208ca10>
# Create table with content and metadata
kb = pxt.create_table('search_demo.articles', {
'title': pxt.String,
'content': pxt.String,
'category': pxt.String
})
Created table ‘articles’.
# Insert sample content
kb.insert([
{'title': 'Debugging best practices',
'content': 'Use logging, breakpoints, and unit tests to identify and fix issues in your code.',
'category': 'engineering'},
{'title': 'Machine learning model optimization',
'content': 'Improve training efficiency with batch normalization, learning rate schedules, and early stopping.',
'category': 'ml'},
{'title': 'Production infrastructure setup',
'content': 'Deploy applications using containers, load balancers, and automated scaling.',
'category': 'devops'},
{'title': 'API design principles',
'content': 'Create RESTful endpoints with proper versioning, authentication, and error handling.',
'category': 'engineering'},
])
Inserting rows into `articles`: 4 rows [00:00, 577.69 rows/s]
Inserted 4 rows with 0 errors.
4 rows inserted, 12 values computed.
Add semantic search
Create an embedding index on the content column:
# Add embedding index
kb.add_embedding_index(
column='content',
string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2')
)
Search by meaning
Find content semantically similar to your query:
# Search by meaning
query = "how to fix bugs"
sim = kb.content.similarity(query)
results = (
kb
.order_by(sim, asc=False)
.select(kb.title, kb.content, score=sim)
.limit(2)
)
results.collect()
Combine semantic search with metadata filters:
# Search within a specific category
query = "best practices"
sim = kb.content.similarity(query)
results = (
kb
.where(kb.category == 'engineering') # Filter first
.order_by(sim, asc=False)
.select(kb.title, kb.category, score=sim)
.limit(2)
)
results.collect()
Explanation
How similarity search works:
- Your query is converted to an embedding vector
- Pixeltable finds the most similar vectors in the index
- Results are ranked by cosine similarity (0 to 1)
Embedding models:
New content is indexed automatically:
When you insert new rows, embeddings are generated without extra code.
See also