Export data for ML training

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Convert Pixeltable data to PyTorch DataLoader format for model training.

Problem

You have prepared training data—images with labels, text with embeddings, or multimodal data—and need to export it for PyTorch model training.

Solution

What’s in this recipe:

Convert query results to PyTorch Dataset
Use with DataLoader for batch training
Export to Parquet for external tools

You use query.to_pytorch_dataset() to create an iterable dataset compatible with PyTorch DataLoader.

Setup

%pip install -qU pixeltable torch torchvision

import pixeltable as pxt
import torch
from torch.utils.data import DataLoader

# Create a fresh directory
pxt.drop_dir('pytorch_demo', force=True)
pxt.create_dir('pytorch_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘pytorch_demo’.
<pixeltable.catalog.dir.Dir at 0x16c534ad0>

Create sample training data

# Create table with images and labels
training_data = pxt.create_table(
    'pytorch_demo.training_data',
    {'image': pxt.Image, 'label': pxt.Int}
)

Created table ‘training_data’.

# Insert sample images with labels
base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images'
samples = [
    {'image': f'{base_url}/000000000036.jpg', 'label': 0},  # cat
    {'image': f'{base_url}/000000000090.jpg', 'label': 1},  # other
    {'image': f'{base_url}/000000000139.jpg', 'label': 1},  # other
]
training_data.insert(samples)

Inserting rows into `training_data`: 3 rows [00:00, 659.03 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 6 values computed.

Export to PyTorch dataset

# Add a resize step to ensure all images have the same size
training_data.add_computed_column(
    image_resized=training_data.image.resize((224, 224))
)

# Convert to PyTorch dataset
# 'pt' format returns images as CxHxW tensors with values in [0,1]
pytorch_dataset = training_data.select(
    training_data.image_resized,
    training_data.label
).to_pytorch_dataset(image_format='pt')

Added 3 column values with 0 errors.

# Use with PyTorch DataLoader
dataloader = DataLoader(pytorch_dataset, batch_size=2)

# Get first batch to verify the shape
batch = next(iter(dataloader))
batch['image_resized'].shape  # Should be (2, 3, 224, 224) - batch_size x channels x height x width

torch.Size([2, 3, 224, 224])

Export to Parquet for external tools

import tempfile
from pathlib import Path

# Export to Parquet for use with other ML tools
export_path = Path(tempfile.mkdtemp()) / 'training_data'

pxt.io.export_parquet(
    training_data.select(training_data.label),  # Non-image columns
    parquet_path=export_path
)

Welcome to Pixeltable

Core Concepts

How-To

Problem

Solution

Setup

Create sample training data

Export to PyTorch dataset

Export to Parquet for external tools

Explanation

See also

Welcome to Pixeltable

Core Concepts

How-To

​Problem

​Solution

​Setup

​Create sample training data

​Export to PyTorch dataset

​Export to Parquet for external tools

​Explanation

​See also

Problem

Solution

Setup

Create sample training data

Export to PyTorch dataset

Export to Parquet for external tools

Explanation

See also