Related use case: Data Wrangling for ML
Concept Mapping
| Your DIY Stack | Pixeltable Equivalent |
|---|---|
| S3 buckets for media files | pxt.Image, pxt.Video, pxt.Audio columns — can still read from S3 |
| DVC for data versioning | Built-in history(), revert(), create_snapshot() |
| Airflow / cron for scheduling | Computed columns — run automatically on insert |
| Custom scripts with OpenCV / PIL | @pxt.udf functions as computed columns |
cv2.VideoCapture() + frame loops | frame_iterator via create_view() |
Manual retry logic (tenacity) | Automatic retries with result caching |
| Embeddings as numpy / Parquet | add_embedding_index() with HNSW search |
torch.utils.data.Dataset boilerplate | to_pytorch_dataset() — one line |
| Re-run pipeline when data changes | Incremental — only new rows are processed |
Side by Side: Image Processing Pipeline
Process images: generate thumbnails, caption with an LLM, embed for search, version everything.- Custom Scripts
- Pixeltable
What Changes
| Custom Scripts | Pixeltable | |
|---|---|---|
| New images | Re-run the entire pipeline | images.insert([...]) — everything downstream runs |
| Change model | Re-run everything; DVC tracks snapshots, not transforms | Drop and re-add the column — only that column recomputes |
| Versioning | dvc add + git commit ceremony | Automatic — images.history(), pxt.create_snapshot() |
| Scheduling | Airflow, cron, or manual re-runs | Not needed — computed columns run on insert |
| Retries | try/except with backoff in every function | Built-in; successful results are cached |
| Search | Brute-force numpy, or set up a vector DB | add_embedding_index() with HNSW |
| PyTorch export | Custom Dataset class | images.to_pytorch_dataset() |
Common Patterns
Video frame extraction
- OpenCV
- Pixeltable
Data versioning
- DVC
- Pixeltable
PyTorch export
- Custom Dataset
- Pixeltable