Sample data for training and testing

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Create training, validation, and test splits with random or stratified sampling.

Problem

You have a large dataset and need to create subsets for ML training—random samples for quick experiments, stratified samples for balanced classes, or reproducible splits for benchmarking.

Solution

What’s in this recipe:

Random sampling with sample(n=...)
Percentage-based sampling with sample(fraction=...)
Stratified sampling with stratify_by=

You use query.sample() to create random subsets, with optional stratification for balanced class distribution.

Setup

%pip install -qU pixeltable

import pixeltable as pxt

# Create a fresh directory
pxt.drop_dir('sampling_demo', force=True)
pxt.create_dir('sampling_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘sampling_demo’.
<pixeltable.catalog.dir.Dir at 0x1471b08e0>

Create sample dataset

# Create a dataset with labels
data = pxt.create_table(
    'sampling_demo/data',
    {'text': pxt.String, 'label': pxt.String, 'score': pxt.Float},
)

# Insert sample data with imbalanced classes
samples = [
    {'text': 'Great product!', 'label': 'positive', 'score': 0.9},
    {'text': 'Love it', 'label': 'positive', 'score': 0.85},
    {'text': 'Amazing quality', 'label': 'positive', 'score': 0.95},
    {'text': 'Best purchase ever', 'label': 'positive', 'score': 0.88},
    {'text': 'Highly recommend', 'label': 'positive', 'score': 0.92},
    {'text': 'Fantastic!', 'label': 'positive', 'score': 0.91},
    {'text': 'Terrible', 'label': 'negative', 'score': 0.1},
    {'text': 'Waste of money', 'label': 'negative', 'score': 0.15},
    {'text': 'It is okay', 'label': 'neutral', 'score': 0.5},
    {'text': 'Average product', 'label': 'neutral', 'score': 0.55},
]
data.insert(samples)

Created table ‘data’.Inserting rows into `data`: 0 rows [00:00, ? rows/s]
Inserting rows into `data`: 10 rows [00:00, 857.13 rows/s]
Inserted 10 rows with 0 errors.
10 rows inserted, 20 values computed.

Random sampling

# Sample exactly N rows
data.sample(n=5, seed=42).collect()

# Sample a percentage of rows
sample_50pct = data.sample(fraction=0.5, seed=42).collect()

Stratified sampling

# Stratified sampling: 50% from each class
data.sample(fraction=0.5, stratify_by=data.label, seed=42).collect()

# Equal allocation: N rows from each class
data.sample(n_per_stratum=1, stratify_by=data.label, seed=42).collect()

Sampling from filtered data

# Sample from filtered query (high-confidence predictions only)
data.where(data.score > 0.8).sample(n=3, seed=42).collect()

Persist samples as tables

# Create a persistent table from a sample for dev/test
train_sample = data.sample(fraction=0.8, seed=42)
test_sample = data.sample(fraction=0.2, seed=43)

# Persist as new tables
train_table = pxt.create_table('sampling_demo/train', source=train_sample)
test_table = pxt.create_table('sampling_demo/test', source=test_sample)

Created table ‘train’.Inserting rows into `train`: 0 rows [00:00, ? rows/s]
Inserting rows into `train`: 9 rows [00:00, 3080.27 rows/s]
Created table ‘test’.Inserting rows into `test`: 0 rows [00:00, ? rows/s]
Inserting rows into `test`: 3 rows [00:00, 1333.92 rows/s]

Explanation

Sampling methods:

Stratification options:

Tips:

Always set seed for reproducible experiments
Use stratified sampling for imbalanced datasets
Combine with .where() to sample from subsets

Welcome to Pixeltable

Core Concepts

How-To

Problem

Solution

Setup

Create sample dataset

Random sampling

Stratified sampling

Sampling from filtered data

Persist samples as tables

Explanation

See also

Welcome to Pixeltable

Core Concepts

How-To

​Problem

​Solution

​Setup

​Create sample dataset

​Random sampling

​Stratified sampling

​Sampling from filtered data

​Persist samples as tables

​Explanation

​See also

Problem

Solution

Setup

Create sample dataset

Random sampling

Stratified sampling

Sampling from filtered data

Persist samples as tables

Explanation

See also