This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Transform office documents into searchable, analyzable text data.
What’s in this recipe: - Extract text from PPTX, DOCX, and XLSX
files - Split documents by headings, paragraphs, or custom limits -
Preserve document structure and metadata for analysis
Problem
You have office documents—presentations, reports, spreadsheets—that
contain valuable text data. You need to extract this text to analyze
content, search across documents, or feed into AI models.
Manual extraction means opening each file, copying text, and losing
structural information like headings and page boundaries. You need an
automated way to process hundreds or thousands of office files while
preserving their organization.
Solution
You extract text from office documents using Pixeltable’s document type
with Microsoft’s MarkItDown library. This converts PowerPoint, Word, and
Excel files to structured text automatically.
You use DocumentSplitter to split documents by headings, paragraphs,
or token limits. Each split creates a view where each row represents a
chunk of the document with its metadata.
Setup
%pip install -qU pixeltable 'markitdown[pptx,docx,xlsx]'
import pixeltable as pxt
from pixeltable.iterators.document import DocumentSplitter
Load office documents
# Create a fresh directory (drop existing if present)
pxt.drop_dir('office_docs', force=True)
pxt.create_dir('office_docs')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘office_docs’.
<pixeltable.catalog.dir.Dir at 0x146c24c10>
# Create table for office documents
docs = pxt.create_table('office_docs.documents', {'doc': pxt.Document})
Created table ‘documents’.
# Sample PowerPoint from Pixeltable repo
# Replace with your own PPTX, DOCX, or XLSX files
sample_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/calpy.pptx'
docs.insert([{'doc': sample_url}])
Inserting rows into `documents`: 1 rows [00:00, 57.40 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.
You create a view with DocumentSplitter to extract text. Setting
separators='' extracts the full document without splitting.
# Create a view to extract full document text
full_text = pxt.create_view(
'office_docs.full_text',
docs,
iterator=DocumentSplitter.create(
document=docs.doc,
separators='', # No splitting - extract full document
)
)
Inserting rows into `full_text`: 1 rows [00:00, 196.50 rows/s]
# Preview extracted text
full_text.select(full_text.doc, full_text.text).head(1)
Split documents by headings
You split documents by headings to preserve their logical structure.
Each section under a heading becomes a separate chunk.
# Create view that splits by headings
by_heading = pxt.create_view(
'office_docs.by_heading',
docs,
iterator=DocumentSplitter.create(
document=docs.doc,
separators='heading',
metadata='heading', # Preserve heading structure
)
)
Inserting rows into `by_heading`: 87 rows [00:00, 10359.54 rows/s]
# View chunks with their headings
by_heading.select(by_heading.heading, by_heading.text).head(5)
Split by token limit for AI models
You split documents by token count when feeding chunks to AI models. The
overlap parameter ensures chunks share context at boundaries.
# Create view with token-based splitting
by_tokens = pxt.create_view(
'office_docs.by_tokens',
docs,
iterator=DocumentSplitter.create(
document=docs.doc,
separators='heading,token_limit', # Split by heading first, then by tokens
limit=512, # Maximum tokens per chunk
overlap=50, # Overlap between chunks to preserve context
metadata='heading',
)
)
Inserting rows into `by_tokens`: 2369 rows [00:00, 9212.05 rows/s]
# Preview chunks with token limits
by_tokens.select(by_tokens.doc, by_tokens.heading, by_tokens.text).head(3)
Search across documents
You search across all document chunks using standard Pixeltable queries.
# Find chunks containing specific keywords
by_tokens.where(by_tokens.text.contains('Python')).select(by_tokens.doc, by_tokens.text).head(3)
Explanation
Supported formats: - PowerPoint: .pptx, .ppt - Word: .docx,
.doc - Excel: .xlsx, .xls
Separator options: - heading - Split by document headings
(preserves structure) - paragraph - Split by paragraphs - sentence -
Split by sentences - token_limit - Split by token count (requires
limit parameter) - char_limit - Split by character count (requires
limit parameter) - Multiple separators work together:
'heading,token_limit' splits by heading first, then ensures no chunk
exceeds token limit
Metadata fields: - heading - Hierarchical heading structure (e.g.,
{'h1': 'Introduction', 'h2': 'Overview'}) - title - Document title -
sourceline - Source line number (HTML and Markdown documents)
Token overlap: The overlap parameter ensures chunks share context
at boundaries. This prevents sentences from being split mid-thought when
feeding chunks to AI models.
See also