Documentation Index
Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
Use this file to discover all available pages before exploring further.
- Extract text from PPTX, DOCX, and XLSX files
- Split documents by headings, paragraphs, or custom limits
- Preserve document structure and metadata for analysis
Problem
You have office documents—presentations, reports, spreadsheets—that contain valuable text data. You need to extract this text to analyze content, search across documents, or feed into AI models. Manual extraction means opening each file, copying text, and losing structural information like headings and page boundaries. You need an automated way to process hundreds or thousands of office files while preserving their organization.Solution
You extract text from office documents using Pixeltable’s document type with Microsoft’s MarkItDown library. This converts PowerPoint, Word, and Excel files to structured text automatically. You useDocumentSplitter to split documents by headings, paragraphs,
or token limits. Each split creates a view where each row represents a
chunk of the document with its metadata.
Setup
Load office documents
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘office_docs’.
<pixeltable.catalog.dir.Dir at 0x146c24c10>
Created table ‘documents’.
Inserting rows into `documents`: 1 rows [00:00, 57.40 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.
Extract full document text
You create a view withDocumentSplitter to extract text. Setting
separators='' extracts the full document without splitting.
Inserting rows into `full_text`: 1 rows [00:00, 196.50 rows/s]
Split documents by headings
You split documents by headings to preserve their logical structure. Each section under a heading becomes a separate chunk.Inserting rows into `by_heading`: 87 rows [00:00, 10359.54 rows/s]
Split by token limit for AI models
You split documents by token count when feeding chunks to AI models. Theoverlap parameter ensures chunks share context at boundaries.
Inserting rows into `by_tokens`: 2369 rows [00:00, 9212.05 rows/s]
Search across documents
You search across all document chunks using standard Pixeltable queries.Explanation
Supported formats:- PowerPoint:
.pptx,.ppt - Word:
.docx,.doc - Excel:
.xlsx,.xls
heading- Split by document headings (preserves structure)paragraph- Split by paragraphssentence- Split by sentencestoken_limit- Split by token count (requireslimitparameter)char_limit- Split by character count (requireslimitparameter)- Multiple separators work together:
'heading,token_limit'splits by heading first, then ensures no chunk exceeds token limit
heading- Hierarchical heading structure (e.g.,{'h1': 'Introduction', 'h2': 'Overview'})title- Document titlesourceline- Source line number (HTML and Markdown documents)
overlap parameter ensures chunks share context
at boundaries. This prevents sentences from being split mid-thought when
feeding chunks to AI models.