> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Build semantic search for text

> Build semantic text search in Pixeltable with embedding indices, similarity queries, and top-k retrieval over documents and chunks.

<a href="https://kaggle.com/kernels/welcome?src=https://github.com/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/search/search-semantic-text.ipynb" id="openKaggle" target="_blank" rel="noopener noreferrer"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open in Kaggle" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://colab.research.google.com/github/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/search/search-semantic-text.ipynb" id="openColab" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://raw.githubusercontent.com/pixeltable/pixeltable/refs/tags/release/docs/release/howto/cookbooks/search/search-semantic-text.ipynb" id="downloadNotebook" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/%E2%AC%87-Download%20Notebook-blue" alt="Download Notebook" style={{ display: 'inline', margin: '0px' }} noZoom /></a>

<Tip>This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.</Tip>

export const quartoRawHtml = [`
<table>
<colgroup>
<col style="width: 18%" />
<col style="width: 39%" />
<col style="width: 42%" />
</colgroup>
<thead>
<tr>
<th>Query</th>
<th>Keyword match</th>
<th>Semantic match</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">“how to fix bugs”</td>
<td style="vertical-align: middle;">❌ No results</td>
<td style="vertical-align: middle;">✓ “Debugging best practices”</td>
</tr>
<tr>
<td style="vertical-align: middle;">“ML training”</td>
<td style="vertical-align: middle;">❌ No results</td>
<td style="vertical-align: middle;">✓ “Machine learning model optimization”</td>
</tr>
<tr>
<td style="vertical-align: middle;">“deploy to cloud”</td>
<td style="vertical-align: middle;">❌ No results</td>
<td style="vertical-align: middle;">✓ “Production infrastructure setup”</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">title</th>
<th data-quarto-table-cell-role="th">content</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Debugging best practices</td>
<td style="vertical-align: middle;">Use logging, breakpoints, and unit tests to identify and fix issues
in your code.</td>
<td style="vertical-align: middle;">0.391</td>
</tr>
<tr>
<td style="vertical-align: middle;">API design principles</td>
<td style="vertical-align: middle;">Create RESTful endpoints with proper versioning, authentication, and
error handling.</td>
<td style="vertical-align: middle;">0.186</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">title</th>
<th data-quarto-table-cell-role="th">category</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">API design principles</td>
<td style="vertical-align: middle;">engineering</td>
<td style="vertical-align: middle;">0.238</td>
</tr>
<tr>
<td style="vertical-align: middle;">Debugging best practices</td>
<td style="vertical-align: middle;">engineering</td>
<td style="vertical-align: middle;">0.157</td>
</tr>
</tbody>
</table>
`, `
<table>
<thead>
<tr>
<th>Model</th>
<th>Speed</th>
<th>Quality</th>
<th>Use case</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;"><code>all-MiniLM-L6-v2</code></td>
<td style="vertical-align: middle;">Fast</td>
<td style="vertical-align: middle;">Good</td>
<td style="vertical-align: middle;">General text</td>
</tr>
<tr>
<td style="vertical-align: middle;"><code>all-mpnet-base-v2</code></td>
<td style="vertical-align: middle;">Medium</td>
<td style="vertical-align: middle;">Better</td>
<td style="vertical-align: middle;">Higher accuracy</td>
</tr>
<tr>
<td style="vertical-align: middle;">OpenAI <code>text-embedding-3-small</code></td>
<td style="vertical-align: middle;">API</td>
<td style="vertical-align: middle;">Best</td>
<td style="vertical-align: middle;">Production apps</td>
</tr>
</tbody>
</table>
`];

Create a searchable knowledge base that finds content by meaning, not
just keywords.

## Problem

You have a collection of text content (articles, notes, documentation)
and need to find relevant items based on meaning.

Keyword search fails when users phrase queries differently from the
source text:

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[0] }} />

## Solution

**What’s in this recipe:**

* Create a text table with embeddings
* Search by semantic similarity
* Combine with metadata filters

You add an embedding index to your text column. Pixeltable automatically
generates embeddings for each row and enables similarity search.

### Setup

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
%pip install -qU pixeltable sentence-transformers
```

### Create knowledge base

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer

# Create a fresh directory
pxt.drop_dir('search_demo', force=True)
pxt.create_dir('search_demo')
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'search\_demo'.
  \<pixeltable.catalog.dir.Dir at 0x14208ca10>
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Create table with content and metadata
kb = pxt.create_table(
    'search_demo/articles',
    {'title': pxt.String, 'content': pxt.String, 'category': pxt.String},
)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created table 'articles'.
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Insert sample content
kb.insert(
    [
        {
            'title': 'Debugging best practices',
            'content': 'Use logging, breakpoints, and unit tests to identify and fix issues in your code.',
            'category': 'engineering',
        },
        {
            'title': 'Machine learning model optimization',
            'content': 'Improve training efficiency with batch normalization, learning rate schedules, and early stopping.',
            'category': 'ml',
        },
        {
            'title': 'Production infrastructure setup',
            'content': 'Deploy applications using containers, load balancers, and automated scaling.',
            'category': 'devops',
        },
        {
            'title': 'API design principles',
            'content': 'Create RESTful endpoints with proper versioning, authentication, and error handling.',
            'category': 'engineering',
        },
    ]
)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Inserting rows into \`articles\`: 4 rows \[00:00, 577.69 rows/s]
  Inserted 4 rows with 0 errors.
  4 rows inserted, 12 values computed.
</pre>

### Add semantic search

Create an embedding index on the content column:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Add embedding index
kb.add_embedding_index(
    column='content',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'),
)
```

### Search by meaning

Find content semantically similar to your query:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Search by meaning
query = 'how to fix bugs'
sim = kb.content.similarity(string=query)

results = (
    kb.order_by(sim, asc=False)
    .select(kb.title, kb.content, score=sim)
    .limit(2)
)
results.collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[1] }} />

### Filter by metadata

Combine semantic search with metadata filters:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Search within a specific category
query = 'best practices'
sim = kb.content.similarity(string=query)

results = (
    kb.where(kb.category == 'engineering')  # Filter first
    .order_by(sim, asc=False)
    .select(kb.title, kb.category, score=sim)
    .limit(2)
)
results.collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[2] }} />

## Explanation

**How similarity search works:**

1. Your query is converted to an embedding vector
2. Pixeltable finds the most similar vectors in the index
3. Results are ranked by cosine similarity (0 to 1)

**Embedding models:**

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[3] }} />

**New content is indexed automatically:**

When you insert new rows, embeddings are generated without extra code.

## See also

* [Vector database
  documentation](/platform/embedding-indexes)
* [Split documents for
  RAG](/howto/cookbooks/text/doc-chunk-for-rag)
