Related use case: Data Wrangling for ML
Concept Mapping
| Your DIY Stack | Pixeltable Equivalent |
|---|---|
| S3 buckets for media files | pxt.Image, pxt.Video, pxt.Audio columns — can still read from S3 |
| DVC for data versioning | Built-in history(), revert(), create_snapshot() |
| Airflow / cron for scheduling | Computed columns — run automatically on insert |
| Custom scripts with OpenCV / PIL | @pxt.udf functions as computed columns |
cv2.VideoCapture() + frame loops | frame_iterator via create_view() |
Manual retry logic (tenacity) | Automatic retries with result caching |
| Embeddings as numpy / Parquet | add_embedding_index() with HNSW search |
torch.utils.data.Dataset boilerplate | to_pytorch_dataset() — one line |
| Re-run pipeline when data changes | Incremental — only new rows are processed |
Side by Side: Image Processing Pipeline
Process images: generate thumbnails, caption with an LLM, embed for search, version everything.- Custom Scripts
- Pixeltable
What Changes
| Custom Scripts | Pixeltable | |
|---|---|---|
| New images | Re-run the entire pipeline | images.insert([...]) — everything downstream runs |
| Change model | Re-run everything; DVC tracks snapshots, not transforms | Drop and re-add the column — only that column recomputes |
| Versioning | dvc add + git commit ceremony | Automatic — images.history(), pxt.create_snapshot() |
| Scheduling | Airflow, cron, or manual re-runs | Not needed — computed columns run on insert |
| Retries | try/except with backoff in every function | Built-in; successful results are cached |
| Search | Brute-force numpy, or set up a vector DB | add_embedding_index() with HNSW |
| PyTorch export | Custom Dataset class | images.to_pytorch_dataset() |
Common Patterns
Video frame extraction
- OpenCV
- Pixeltable
Data versioning
- DVC
- Pixeltable
PyTorch export
- Custom Dataset
- Pixeltable
Next Steps
Data Wrangling for ML
Full use case walkthrough
Extract Video Frames
Frame extraction with FPS control
Export to PyTorch
Convert tables to DataLoaders
Cloud Storage
S3, GCS, Azure, R2, Tigris