Split documents into chunks for RAG

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Break PDFs and documents into searchable chunks for retrieval-augmented generation (RAG) pipelines.

Problem

You have PDF documents or text files that you want to use for retrieval-augmented generation (RAG). Before you can search them, you need to:

Split documents into smaller chunks
Generate embeddings for each chunk
Store everything in a searchable index

Solution

What’s in this recipe:

Split PDFs into sentences with token limits
Control chunk size with token limits
Add embeddings for semantic search

You create a view with a document_splitter iterator that automatically breaks documents into chunks. Then you add an embedding index for semantic search.

Setup

%pip install -qU pixeltable sentence-transformers spacy tiktoken
!python -m spacy download en_core_web_sm -q

import pixeltable as pxt
from pixeltable.functions.document import document_splitter
from pixeltable.functions.huggingface import sentence_transformer

Load documents

# Create a fresh directory
pxt.drop_dir('rag_demo', force=True)
pxt.create_dir('rag_demo')

Created directory ‘rag_demo’.
<pixeltable.catalog.dir.Dir at 0x3d8e31710>

# Create table for documents
docs = pxt.create_table('rag_demo/documents', {'document': pxt.Document})

Created table ‘documents’.

# Insert a sample PDF
docs.insert(
    [
        {
            'document': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf'
        }
    ]
)

Inserting rows into `documents`: 1 rows [00:00, 775.86 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.

Split into chunks

Create a view that splits each document into sentences with a token limit:

# Create a view that splits documents into chunks
chunks = pxt.create_view(
    'rag_demo/chunks',
    docs,
    iterator=document_splitter(
        docs.document,
        separators='sentence,token_limit',  # Split by sentence with token limit
        limit=300,  # Max 300 tokens per chunk
    ),
)

Inserting rows into `chunks`: 217 rows [00:00, 42111.88 rows/s]

# View the chunks
chunks.select(chunks.text).head(5)

Add semantic search

Create an embedding index on the chunks for similarity search:

# Add embedding index for semantic search
chunks.add_embedding_index(
    column='text',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'),
)

Search your documents

Use similarity search to find relevant chunks:

# Search for relevant chunks
query = 'market trends'
sim = chunks.text.similarity(string=query)

results = (
    chunks.order_by(sim, asc=False)
    .select(chunks.text, score=sim)
    .limit(3)
)
results.collect()

Explanation

Separator options:

You can combine separators: separators='sentence,token_limit' Chunk sizing:

limit: Maximum tokens per chunk (default: 500)
overlap: Tokens to overlap between chunks (default: 0)

New documents are processed automatically: When you insert new documents, chunks and embeddings are generated without extra code.

Welcome to Pixeltable

Core Concepts

How-To

Problem

Solution

Setup

Load documents

Split into chunks

Add semantic search

Search your documents

Explanation

See also

Welcome to Pixeltable

Core Concepts

How-To

​Problem

​Solution

​Setup

​Load documents

​Split into chunks

​Add semantic search

​Search your documents

​Explanation

​See also

Problem

Solution

Setup

Load documents

Split into chunks

Add semantic search

Search your documents

Explanation

See also