doc	text
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx	November 6 2025 Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker Notes: About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‹#› Notes: The problem with AI development today ‹#› Notes: "I want to make a searchable collection ...... tic Propagation ================================================================================ ‹#› Notes: Your one stop shop for developing AI-based data products Complete - capture all the data you need, doesn't limit what you do with the data Store of record - don't need separate place [ ] - express any transformation or other application logic → Complete - real production is multi user → Complete - real AI use cases require captures all the data types → Complete - augment it ‹#› Notes:

doc

text

/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx

November 6 2025 Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker Notes: About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‹#› Notes: The problem with AI development today ‹#› Notes: "I want to make a searchable collection ...... tic Propagation ================================================================================ ‹#› Notes: Your one stop shop for developing AI-based data products Complete - capture all the data you need, doesn't limit what you do with the data Store of record - don't need separate place [ ] - express any transformation or other application logic → Complete - real production is multi user → Complete - real AI use cases require captures all the data types → Complete - augment it ‹#› Notes:

heading	text
{}	November 6 2025
{"h1": "Open-Source Data Infrastructure for Multimodal AI"}	Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker
{"h1": "Open-Source Data Infrastructure for Multimodal AI", "h3": "Notes:"}	Notes:
{"h1": "About me"}	About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‹#›
{"h1": "About me", "h3": "Notes:"}	Notes:

heading

text

{}

November 6 2025

{"h1": "Open-Source Data Infrastructure for Multimodal AI"}

Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker

{"h1": "Open-Source Data Infrastructure for Multimodal AI", "h3": "Notes:"}

Notes:

{"h1": "About me"}

About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‹#›

{"h1": "About me", "h3": "Notes:"}

Notes:

doc	heading	text
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx	{}	November 6 2025
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx	{}	6 2025
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx	{}	6 2025

doc

heading

text

/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx

{}

November 6 2025

/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx

{}

6 2025

/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx

{}

6 2025

doc	text
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx	Storage 🗄️ Orchestration ⚙️ What you get: Videos in S3, embeddings in Pinecone, metadata in Postgres… Data loaded into memory, exported to file formats File formats that don't support media data Manual tracking of what lives where What you miss: Transactions: Models fail halfway → data stays inconsistent Concurrency: Multiple users → can't work on same data simultaneously Persistence: Work happens in memory → doesn't map to traditional database schemas OLTP capabilities: Built for batch → ca ...... g tools together Cron jobs and Python scripts for every step Manually handling rate limits, retries, chasing API errors Wild goose chase when requirements change What you miss: Dependency tracking: Transforms happen in scripts → hard to trace what depends on what Low latency/high throughput: Hard to parallelize external API calls → poor performance Failure handling: Something fails somewhere → rerun Operational integrity: Inconsistent models for indexing and querying → contaminated index ‹#›
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx	Storage 🗄️ Orchestration ⚙️ What you get: Videos in S3, embeddings in Pinecone, metadata in Postgres… Data loaded into memory, exported to file formats File formats that don't support media data Manual tracking of what lives where What you miss: Transactions: Models fail halfway → data stays inconsistent Concurrency: Multiple users → can't work on same data simultaneously Persistence: Work happens in memory → doesn't map to traditional database schemas OLTP capabilities: Built for batch → ca ...... g tools together Cron jobs and Python scripts for every step Manually handling rate limits, retries, chasing API errors Wild goose chase when requirements change What you miss: Dependency tracking: Transforms happen in scripts → hard to trace what depends on what Low latency/high throughput: Hard to parallelize external API calls → poor performance Failure handling: Something fails somewhere → rerun Operational integrity: Inconsistent models for indexing and querying → contaminated index ‹#›
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx	Storage 🗄️ Orchestration ⚙️ What you get: Videos in S3, embeddings in Pinecone, metadata in Postgres… Data loaded into memory, exported to file formats File formats that don't support media data Manual tracking of what lives where What you miss: Transactions: Models fail halfway → data stays inconsistent Concurrency: Multiple users → can't work on same data simultaneously Persistence: Work happens in memory → doesn't map to traditional database schemas OLTP capabilities: Built for batch → ca ...... g tools together Cron jobs and Python scripts for every step Manually handling rate limits, retries, chasing API errors Wild goose chase when requirements change What you miss: Dependency tracking: Transforms happen in scripts → hard to trace what depends on what Low latency/high throughput: Hard to parallelize external API calls → poor performance Failure handling: Something fails somewhere → rerun Operational integrity: Inconsistent models for indexing and querying → contaminated index ‹#›

doc

text

/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx

Storage 🗄️ Orchestration ⚙️ What you get: Videos in S3, embeddings in Pinecone, metadata in Postgres… Data loaded into memory, exported to file formats File formats that don't support media data Manual tracking of what lives where What you miss: Transactions: Models fail halfway → data stays inconsistent Concurrency: Multiple users → can't work on same data simultaneously Persistence: Work happens in memory → doesn't map to traditional database schemas OLTP capabilities: Built for batch → ca ...... g tools together Cron jobs and Python scripts for every step Manually handling rate limits, retries, chasing API errors Wild goose chase when requirements change What you miss: Dependency tracking: Transforms happen in scripts → hard to trace what depends on what Low latency/high throughput: Hard to parallelize external API calls → poor performance Failure handling: Something fails somewhere → rerun Operational integrity: Inconsistent models for indexing and querying → contaminated index ‹#›

/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx

### Split documents by headings You split documents by headings to preserve their logical structure. Each section under a heading becomes a separate chunk. ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create view that splits by headings by_heading = pxt.create_view( 'office_docs/by_heading', docs, iterator=DocumentSplitter.create( document=docs.doc, separators='heading', metadata='heading', # Preserve heading structure ), ) ```

  Inserting rows into \`by\_heading\`: 87 rows \[00:00, 10359.54 rows/s]

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # View chunks with their headings by_heading.select(by_heading.heading, by_heading.text).head(5) ```

### Split by token limit for AI models You split documents by token count when feeding chunks to AI models. The `overlap` parameter ensures chunks share context at boundaries. ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create view with token-based splitting by_tokens = pxt.create_view( 'office_docs/by_tokens', docs, iterator=DocumentSplitter.create( document=docs.doc, separators='heading,token_limit', # Split by heading first, then by tokens limit=512, # Maximum tokens per chunk overlap=50, # Overlap between chunks to preserve context metadata='heading', ), ) ```

  Inserting rows into \`by\_tokens\`: 2369 rows \[00:00, 9212.05 rows/s]

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Preview chunks with token limits by_tokens.select(by_tokens.doc, by_tokens.heading, by_tokens.text).head(3) ```

### Search across documents You search across all document chunks using standard Pixeltable queries. ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Find chunks containing specific keywords by_tokens.where(by_tokens.text.contains('Python')).select( by_tokens.doc, by_tokens.text ).head(3) ```

## Explanation **Supported formats:** * PowerPoint: `.pptx`, `.ppt` * Word: `.docx`, `.doc` * Excel: `.xlsx`, `.xls` **Separator options:** * `heading` - Split by document headings (preserves structure) * `paragraph` - Split by paragraphs * `sentence` - Split by sentences * `token_limit` - Split by token count (requires `limit` parameter) * `char_limit` - Split by character count (requires `limit` parameter) * Multiple separators work together: `'heading,token_limit'` splits by heading first, then ensures no chunk exceeds token limit **Metadata fields:** * `heading` - Hierarchical heading structure (e.g., `{'h1': 'Introduction', 'h2': 'Overview'}`) * `title` - Document title * `sourceline` - Source line number (HTML and Markdown documents) **Token overlap:** The `overlap` parameter ensures chunks share context at boundaries. This prevents sentences from being split mid-thought when feeding chunks to AI models. ## See also * [Get fast feedback on transformations](/howto/cookbooks/core/dev-iterative-workflow) * [Pixeltable Document API](/sdk/latest/document)