Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Break PDFs and documents into searchable chunks for retrieval-augmented generation (RAG) pipelines.

Problem

You have PDF documents or text files that you want to use for retrieval-augmented generation (RAG). Before you can search them, you need to:
  1. Split documents into smaller chunks
  2. Generate embeddings for each chunk
  3. Store everything in a searchable index

Solution

What’s in this recipe:
  • Split PDFs into sentences with token limits
  • Control chunk size with token limits
  • Add embeddings for semantic search
You create a view with a DocumentSplitter iterator that automatically breaks documents into chunks. Then you add an embedding index for semantic search.

Setup

%pip install -qU pixeltable sentence-transformers spacy
import pixeltable as pxt
from pixeltable.iterators import DocumentSplitter
from pixeltable.functions.huggingface import sentence_transformer

Load documents

# Create a fresh directory
pxt.drop_dir('rag_demo', force=True)
pxt.create_dir('rag_demo')
Created directory ‘rag_demo’.
<pixeltable.catalog.dir.Dir at 0x3d8e31710>
# Create table for documents
docs = pxt.create_table('rag_demo.documents', {'document': pxt.Document})
Created table ‘documents’.
# Insert a sample PDF
docs.insert([
    {'document': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf'}
])
Inserting rows into `documents`: 1 rows [00:00, 775.86 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.

Split into chunks

Create a view that splits each document into sentences with a token limit:
# Create a view that splits documents into chunks
chunks = pxt.create_view(
    'rag_demo.chunks',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.document,
        separators='sentence,token_limit',  # Split by sentence with token limit
        limit=300  # Max 300 tokens per chunk
    )
)
Inserting rows into `chunks`: 217 rows [00:00, 42111.88 rows/s]
# View the chunks
chunks.select(chunks.text).head(5)
Create an embedding index on the chunks for similarity search:
# Add embedding index for semantic search
chunks.add_embedding_index(
    column='text',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2')
)

Search your documents

Use similarity search to find relevant chunks:
# Search for relevant chunks
query = "market trends"
sim = chunks.text.similarity(query)

results = (
    chunks
    .order_by(sim, asc=False)
    .select(chunks.text, score=sim)
    .limit(3)
)
results.collect()

Explanation

Separator options:
You can combine separators: separators='sentence,token_limit' Chunk sizing:
  • limit: Maximum tokens per chunk (default: 500)
  • overlap: Tokens to overlap between chunks (default: 0)
New documents are processed automatically: When you insert new documents, chunks and embeddings are generated without extra code.

See also