Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Transform office documents into searchable, analyzable text data. What’s in this recipe: - Extract text from PPTX, DOCX, and XLSX files - Split documents by headings, paragraphs, or custom limits - Preserve document structure and metadata for analysis

Problem

You have office documents—presentations, reports, spreadsheets—that contain valuable text data. You need to extract this text to analyze content, search across documents, or feed into AI models. Manual extraction means opening each file, copying text, and losing structural information like headings and page boundaries. You need an automated way to process hundreds or thousands of office files while preserving their organization.

Solution

You extract text from office documents using Pixeltable’s document type with Microsoft’s MarkItDown library. This converts PowerPoint, Word, and Excel files to structured text automatically. You use DocumentSplitter to split documents by headings, paragraphs, or token limits. Each split creates a view where each row represents a chunk of the document with its metadata.

Setup

%pip install -qU pixeltable 'markitdown[pptx,docx,xlsx]'
import pixeltable as pxt
from pixeltable.iterators.document import DocumentSplitter

Load office documents

# Create a fresh directory (drop existing if present)
pxt.drop_dir('office_docs', force=True)
pxt.create_dir('office_docs')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘office_docs’.
<pixeltable.catalog.dir.Dir at 0x146c24c10>
# Create table for office documents
docs = pxt.create_table('office_docs.documents', {'doc': pxt.Document})
Created table ‘documents’.
# Sample PowerPoint from Pixeltable repo
# Replace with your own PPTX, DOCX, or XLSX files
sample_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/calpy.pptx'

docs.insert([{'doc': sample_url}])
Inserting rows into `documents`: 1 rows [00:00, 57.40 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 2 values computed.

Extract full document text

You create a view with DocumentSplitter to extract text. Setting separators='' extracts the full document without splitting.
# Create a view to extract full document text
full_text = pxt.create_view(
    'office_docs.full_text',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='',  # No splitting - extract full document
    )
)
Inserting rows into `full_text`: 1 rows [00:00, 196.50 rows/s]
# Preview extracted text
full_text.select(full_text.doc, full_text.text).head(1)

Split documents by headings

You split documents by headings to preserve their logical structure. Each section under a heading becomes a separate chunk.
# Create view that splits by headings
by_heading = pxt.create_view(
    'office_docs.by_heading',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading',
        metadata='heading',  # Preserve heading structure
    )
)
Inserting rows into `by_heading`: 87 rows [00:00, 10359.54 rows/s]
# View chunks with their headings
by_heading.select(by_heading.heading, by_heading.text).head(5)

Split by token limit for AI models

You split documents by token count when feeding chunks to AI models. The overlap parameter ensures chunks share context at boundaries.
# Create view with token-based splitting
by_tokens = pxt.create_view(
    'office_docs.by_tokens',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading,token_limit',  # Split by heading first, then by tokens
        limit=512,  # Maximum tokens per chunk
        overlap=50,  # Overlap between chunks to preserve context
        metadata='heading',
    )
)
Inserting rows into `by_tokens`: 2369 rows [00:00, 9212.05 rows/s]
# Preview chunks with token limits
by_tokens.select(by_tokens.doc, by_tokens.heading, by_tokens.text).head(3)

Search across documents

You search across all document chunks using standard Pixeltable queries.
# Find chunks containing specific keywords
by_tokens.where(by_tokens.text.contains('Python')).select(by_tokens.doc, by_tokens.text).head(3)

Explanation

Supported formats: - PowerPoint: .pptx, .ppt - Word: .docx, .doc - Excel: .xlsx, .xls Separator options: - heading - Split by document headings (preserves structure) - paragraph - Split by paragraphs - sentence - Split by sentences - token_limit - Split by token count (requires limit parameter) - char_limit - Split by character count (requires limit parameter) - Multiple separators work together: 'heading,token_limit' splits by heading first, then ensures no chunk exceeds token limit Metadata fields: - heading - Hierarchical heading structure (e.g., {'h1': 'Introduction', 'h2': 'Overview'}) - title - Document title - sourceline - Source line number (HTML and Markdown documents) Token overlap: The overlap parameter ensures chunks share context at boundaries. This prevents sentences from being split mid-thought when feeding chunks to AI models.

See also