Document	Size	Chunks needed
annual_report.pdf	50 pages	~100 chunks
user_manual.pdf	20 pages	~40 chunks
research_paper.pdf	10 pages	~20 chunks

text
MARKET DIGEST
- 1 -
FRIDAY, JUNE 21, 2024
JUNE 20, DJIA: 39,134.76 UP 299.90
Independent Equity Research Since 1934 ARGUS

text	score
MARKET REVIEW:	0.558
This is the Market Digest for Friday, June 21, 2024, with analysis of the financial markets and comments on Accenture plc.	0.489
MARKET DIGEST	0.479

Separator	Description
`sentence`	Split on sentence boundaries
`heading`	Split on document headings
`page`	Split on page breaks
`token_limit`	Split at token count only

## Solution **What’s in this recipe:** * Split PDFs into sentences with token limits * Control chunk size with token limits * Add embeddings for semantic search You create a view with a `document_splitter` iterator that automatically breaks documents into chunks. Then you add an embedding index for semantic search. ### Setup ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} %pip install -qU pixeltable sentence-transformers spacy tiktoken !python -m spacy download en_core_web_sm -q ``` ### Load documents ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} import pixeltable as pxt from pixeltable.functions.document import document_splitter from pixeltable.functions.huggingface import sentence_transformer # Create a fresh directory pxt.drop_dir('rag_demo', force=True) pxt.create_dir('rag_demo') ```

  Created directory 'rag\_demo'.
  \

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create table for documents docs = pxt.create_table('rag_demo/documents', {'document': pxt.Document}) ```

  Created table 'documents'.

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Insert a sample PDF docs.insert( [ { 'document': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf' } ] ) ```

  Inserting rows into \`documents\`: 1 rows \[00:00, 775.86 rows/s]
  Inserted 1 row with 0 errors.
  1 row inserted, 2 values computed.

### Split into chunks Create a view that splits each document into sentences with a token limit: ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create a view that splits documents into chunks chunks = pxt.create_view( 'rag_demo/chunks', docs, iterator=document_splitter( docs.document, separators='sentence,token_limit', # Split by sentence with token limit limit=300, # Max 300 tokens per chunk ), ) ```

  Inserting rows into \`chunks\`: 217 rows \[00:00, 42111.88 rows/s]

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # View the chunks chunks.select(chunks.text).head(5) ```

### Add semantic search Create an embedding index on the chunks for similarity search: ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Add embedding index for semantic search chunks.add_embedding_index( column='text', string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'), ) ``` ### Search your documents Use similarity search to find relevant chunks: ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Search for relevant chunks query = 'market trends' sim = chunks.text.similarity(string=query) results = ( chunks.order_by(sim, asc=False) .select(chunks.text, score=sim) .limit(3) ) results.collect() ```

## Explanation **Separator options:**

You can combine separators: `separators='sentence,token_limit'` **Chunk sizing:** * `limit`: Maximum tokens per chunk (default: 500) * `overlap`: Tokens to overlap between chunks (default: 0) **New documents are processed automatically:** When you insert new documents, chunks and embeddings are generated without extra code. ## See also * [Iterators documentation](/platform/iterators) * [RAG demo notebook](/howto/use-cases/rag-demo)