Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Create training, validation, and test splits with random or stratified sampling.

Problem

You have a large dataset and need to create subsets for ML training—random samples for quick experiments, stratified samples for balanced classes, or reproducible splits for benchmarking.

Solution

What’s in this recipe:
  • Random sampling with sample(n=...)
  • Percentage-based sampling with sample(fraction=...)
  • Stratified sampling with stratify_by=
You use query.sample() to create random subsets, with optional stratification for balanced class distribution.

Setup

%pip install -qU pixeltable
import pixeltable as pxt
# Create a fresh directory
pxt.drop_dir('sampling_demo', force=True)
pxt.create_dir('sampling_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘sampling_demo’.
<pixeltable.catalog.dir.Dir at 0x1471b08e0>

Create sample dataset

# Create a dataset with labels
data = pxt.create_table(
    'sampling_demo.data',
    {'text': pxt.String, 'label': pxt.String, 'score': pxt.Float}
)

# Insert sample data with imbalanced classes
samples = [
    {'text': 'Great product!', 'label': 'positive', 'score': 0.9},
    {'text': 'Love it', 'label': 'positive', 'score': 0.85},
    {'text': 'Amazing quality', 'label': 'positive', 'score': 0.95},
    {'text': 'Best purchase ever', 'label': 'positive', 'score': 0.88},
    {'text': 'Highly recommend', 'label': 'positive', 'score': 0.92},
    {'text': 'Fantastic!', 'label': 'positive', 'score': 0.91},
    {'text': 'Terrible', 'label': 'negative', 'score': 0.1},
    {'text': 'Waste of money', 'label': 'negative', 'score': 0.15},
    {'text': 'It is okay', 'label': 'neutral', 'score': 0.5},
    {'text': 'Average product', 'label': 'neutral', 'score': 0.55},
]
data.insert(samples)
Created table ‘data’.Inserting rows into `data`: 0 rows [00:00, ? rows/s]
Inserting rows into `data`: 10 rows [00:00, 857.13 rows/s]
Inserted 10 rows with 0 errors.
10 rows inserted, 20 values computed.

Random sampling

# Sample exactly N rows
data.sample(n=5, seed=42).collect()
# Sample a percentage of rows
sample_50pct = data.sample(fraction=0.5, seed=42).collect()

Stratified sampling

# Stratified sampling: 50% from each class
data.sample(fraction=0.5, stratify_by=data.label, seed=42).collect()
# Equal allocation: N rows from each class
data.sample(n_per_stratum=1, stratify_by=data.label, seed=42).collect()

Sampling from filtered data

# Sample from filtered query (high-confidence predictions only)
data.where(data.score > 0.8).sample(n=3, seed=42).collect()

Persist samples as tables

# Create a persistent table from a sample for dev/test
train_sample = data.sample(fraction=0.8, seed=42)
test_sample = data.sample(fraction=0.2, seed=43)

# Persist as new tables
train_table = pxt.create_table('sampling_demo.train', source=train_sample)
test_table = pxt.create_table('sampling_demo.test', source=test_sample)
Created table ‘train’.Inserting rows into `train`: 0 rows [00:00, ? rows/s]
Inserting rows into `train`: 9 rows [00:00, 3080.27 rows/s]
Created table ‘test’.Inserting rows into `test`: 0 rows [00:00, ? rows/s]
Inserting rows into `test`: 3 rows [00:00, 1333.92 rows/s]

Explanation

Sampling methods:
Stratification options:
Tips:
  • Always set seed for reproducible experiments
  • Use stratified sampling for imbalanced datasets
  • Combine with .where() to sample from subsets

See also