This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Break PDFs and documents into searchable chunks for retrieval-augmented
generation (RAG) pipelines.
Problem
You have PDF documents or text files that you want to use for
retrieval-augmented generation (RAG). Before you can search them, you
need to:
- Split documents into smaller chunks
- Generate embeddings for each chunk
- Store everything in a searchable index
Solution
What’s in this recipe:
- Split PDFs into sentences with token limits
- Control chunk size with token limits
- Add embeddings for semantic search
You create a view with a DocumentSplitter iterator that automatically
breaks documents into chunks. Then you add an embedding index for
semantic search.
Setup
%pip install -qU pixeltable sentence-transformers spacy
import pixeltable as pxt
from pixeltable.iterators import DocumentSplitter
from pixeltable.functions.huggingface import sentence_transformer
Load documents
# Create a fresh directory
pxt.drop_dir('rag_demo', force=True)
pxt.create_dir('rag_demo')
Created directory ‘rag_demo’.
<pixeltable.catalog.dir.Dir at 0x3d8e31710>
# Create table for documents
docs = pxt.create_table('rag_demo.documents', {'document': pxt.Document})
Created table ‘documents’.
# Insert a sample PDF
docs.insert([
{'document': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf'}
])
Inserting rows into `documents`: 1 rows [00:00, 775.86 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.
Split into chunks
Create a view that splits each document into sentences with a token
limit:
# Create a view that splits documents into chunks
chunks = pxt.create_view(
'rag_demo.chunks',
docs,
iterator=DocumentSplitter.create(
document=docs.document,
separators='sentence,token_limit', # Split by sentence with token limit
limit=300 # Max 300 tokens per chunk
)
)
Inserting rows into `chunks`: 217 rows [00:00, 42111.88 rows/s]
# View the chunks
chunks.select(chunks.text).head(5)
Add semantic search
Create an embedding index on the chunks for similarity search:
# Add embedding index for semantic search
chunks.add_embedding_index(
column='text',
string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2')
)
Search your documents
Use similarity search to find relevant chunks:
# Search for relevant chunks
query = "market trends"
sim = chunks.text.similarity(query)
results = (
chunks
.order_by(sim, asc=False)
.select(chunks.text, score=sim)
.limit(3)
)
results.collect()
Explanation
Separator options:
You can combine separators: separators='sentence,token_limit'
Chunk sizing:
limit: Maximum tokens per chunk (default: 500)
overlap: Tokens to overlap between chunks (default: 0)
New documents are processed automatically:
When you insert new documents, chunks and embeddings are generated
without extra code.
See also