Data type	Use case
Images + labels	Image classification
Text + embeddings	Fine-tuning embeddings
Audio + transcripts	Speech model training

Method	Output	Use case
`to_pytorch_dataset()`	PyTorch IterableDataset	Direct training
`export_csv()`	CSV files	Tabular data, spreadsheets
`export_json()`	JSON files	Structured data, APIs
`export_parquet()`	Parquet files	External tools, sharing
`export_lancedb()`	LanceDB	Vector search apps
`to_coco_dataset()`	COCO JSON	Object detection

Format	Shape	Values	Use
`'pt'`	CxHxW	[0, 1] float32	PyTorch models
`'np'`	HxWxC	[0, 255] uint8	NumPy processing

## Solution **What’s in this recipe:** * Convert query results to PyTorch Dataset * Use with DataLoader for batch training * Export to Parquet for external tools You use `query.to_pytorch_dataset()` to create an iterable dataset compatible with PyTorch DataLoader. ### Setup ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} %pip install -qU pixeltable torch torchvision ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} import pixeltable as pxt import torch from torch.utils.data import DataLoader # Create a fresh directory pxt.drop_dir('pytorch_demo', force=True) pxt.create_dir('pytorch_demo') ```

  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'pytorch\_demo'.
  \

### Create sample training data ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create table with images and labels training_data = pxt.create_table( 'pytorch_demo/training_data', {'image': pxt.Image, 'label': pxt.Int} ) ```

  Created table 'training\_data'.

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Insert sample images with labels base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images' samples = [ {'image': f'{base_url}/000000000036.jpg', 'label': 0}, # cat {'image': f'{base_url}/000000000090.jpg', 'label': 1}, # other {'image': f'{base_url}/000000000139.jpg', 'label': 1}, # other ] training_data.insert(samples) ```

  Inserting rows into \`training\_data\`: 3 rows \[00:00, 659.03 rows/s]
  Inserted 3 rows with 0 errors.
  3 rows inserted, 6 values computed.

### Export to PyTorch dataset ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Add a resize step to ensure all images have the same size training_data.add_computed_column( image_resized=training_data.image.resize((224, 224)) ) # Convert to PyTorch dataset # 'pt' format returns images as CxHxW tensors with values in [0,1] pytorch_dataset = training_data.select( training_data.image_resized, training_data.label ).to_pytorch_dataset(image_format='pt') ```

  Added 3 column values with 0 errors.

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Use with PyTorch DataLoader dataloader = DataLoader(pytorch_dataset, batch_size=2) # Get first batch to verify the shape batch = next(iter(dataloader)) batch[ 'image_resized' ].shape # Should be (2, 3, 224, 224) - batch_size x channels x height x width ```

  torch.Size(\[2, 3, 224, 224])

### Export to Parquet for external tools ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} import tempfile from pathlib import Path # Export to Parquet for use with other ML tools export_path = Path(tempfile.mkdtemp()) / 'training_data' pxt.io.export_parquet( training_data.select(training_data.label), # Non-image columns parquet_path=export_path, ) ``` ## Explanation **Export methods:**

**Image format options:**

**DataLoader tips:** * Data is cached to disk for efficient repeated loading * Use `num_workers > 0` for parallel data loading * Filter/transform data before export to reduce size ## See also * [Sample data for training](/howto/cookbooks/data/data-sampling) - Stratified sampling * [Import Parquet files](/howto/cookbooks/data/data-import-parquet) - Parquet import/export