This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Convert Pixeltable data to PyTorch DataLoader format for model training.
Problem
You have prepared training data—images with labels, text with
embeddings, or multimodal data—and need to export it for PyTorch model
training.
Solution
What’s in this recipe:
- Convert query results to PyTorch Dataset
- Use with DataLoader for batch training
- Export to Parquet for external tools
You use query.to_pytorch_dataset() to create an iterable dataset
compatible with PyTorch DataLoader.
Setup
%pip install -qU pixeltable torch torchvision
import pixeltable as pxt
import torch
from torch.utils.data import DataLoader
# Create a fresh directory
pxt.drop_dir('pytorch_demo', force=True)
pxt.create_dir('pytorch_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘pytorch_demo’.
<pixeltable.catalog.dir.Dir at 0x16c534ad0>
Create sample training data
# Create table with images and labels
training_data = pxt.create_table(
'pytorch_demo.training_data',
{'image': pxt.Image, 'label': pxt.Int}
)
Created table ‘training_data’.
# Insert sample images with labels
base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images'
samples = [
{'image': f'{base_url}/000000000036.jpg', 'label': 0}, # cat
{'image': f'{base_url}/000000000090.jpg', 'label': 1}, # other
{'image': f'{base_url}/000000000139.jpg', 'label': 1}, # other
]
training_data.insert(samples)
Inserting rows into `training_data`: 3 rows [00:00, 659.03 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 6 values computed.
Export to PyTorch dataset
# Add a resize step to ensure all images have the same size
training_data.add_computed_column(
image_resized=training_data.image.resize((224, 224))
)
# Convert to PyTorch dataset
# 'pt' format returns images as CxHxW tensors with values in [0,1]
pytorch_dataset = training_data.select(
training_data.image_resized,
training_data.label
).to_pytorch_dataset(image_format='pt')
Added 3 column values with 0 errors.
# Use with PyTorch DataLoader
dataloader = DataLoader(pytorch_dataset, batch_size=2)
# Get first batch to verify the shape
batch = next(iter(dataloader))
batch['image_resized'].shape # Should be (2, 3, 224, 224) - batch_size x channels x height x width
torch.Size([2, 3, 224, 224])
import tempfile
from pathlib import Path
# Export to Parquet for use with other ML tools
export_path = Path(tempfile.mkdtemp()) / 'training_data'
pxt.io.export_parquet(
training_data.select(training_data.label), # Non-image columns
parquet_path=export_path
)
Explanation
Export methods:
Image format options:
DataLoader tips:
- Data is cached to disk for efficient repeated loading
- Use
num_workers > 0 for parallel data loading
- Filter/transform data before export to reduce size
See also