This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Create training, validation, and test splits with random or stratified
sampling.
Problem
You have a large dataset and need to create subsets for ML
training—random samples for quick experiments, stratified samples for
balanced classes, or reproducible splits for benchmarking.
Solution
What’s in this recipe:
- Random sampling with
sample(n=...)
- Percentage-based sampling with
sample(fraction=...)
- Stratified sampling with
stratify_by=
You use query.sample() to create random subsets, with optional
stratification for balanced class distribution.
Setup
%pip install -qU pixeltable
# Create a fresh directory
pxt.drop_dir('sampling_demo', force=True)
pxt.create_dir('sampling_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘sampling_demo’.
<pixeltable.catalog.dir.Dir at 0x1471b08e0>
Create sample dataset
# Create a dataset with labels
data = pxt.create_table(
'sampling_demo.data',
{'text': pxt.String, 'label': pxt.String, 'score': pxt.Float}
)
# Insert sample data with imbalanced classes
samples = [
{'text': 'Great product!', 'label': 'positive', 'score': 0.9},
{'text': 'Love it', 'label': 'positive', 'score': 0.85},
{'text': 'Amazing quality', 'label': 'positive', 'score': 0.95},
{'text': 'Best purchase ever', 'label': 'positive', 'score': 0.88},
{'text': 'Highly recommend', 'label': 'positive', 'score': 0.92},
{'text': 'Fantastic!', 'label': 'positive', 'score': 0.91},
{'text': 'Terrible', 'label': 'negative', 'score': 0.1},
{'text': 'Waste of money', 'label': 'negative', 'score': 0.15},
{'text': 'It is okay', 'label': 'neutral', 'score': 0.5},
{'text': 'Average product', 'label': 'neutral', 'score': 0.55},
]
data.insert(samples)
Created table ‘data’.Inserting rows into `data`: 0 rows [00:00, ? rows/s]
Inserting rows into `data`: 10 rows [00:00, 857.13 rows/s]
Inserted 10 rows with 0 errors.
10 rows inserted, 20 values computed.
Random sampling
# Sample exactly N rows
data.sample(n=5, seed=42).collect()
# Sample a percentage of rows
sample_50pct = data.sample(fraction=0.5, seed=42).collect()
Stratified sampling
# Stratified sampling: 50% from each class
data.sample(fraction=0.5, stratify_by=data.label, seed=42).collect()
# Equal allocation: N rows from each class
data.sample(n_per_stratum=1, stratify_by=data.label, seed=42).collect()
Sampling from filtered data
# Sample from filtered query (high-confidence predictions only)
data.where(data.score > 0.8).sample(n=3, seed=42).collect()
Persist samples as tables
# Create a persistent table from a sample for dev/test
train_sample = data.sample(fraction=0.8, seed=42)
test_sample = data.sample(fraction=0.2, seed=43)
# Persist as new tables
train_table = pxt.create_table('sampling_demo.train', source=train_sample)
test_table = pxt.create_table('sampling_demo.test', source=test_sample)
Created table ‘train’.Inserting rows into `train`: 0 rows [00:00, ? rows/s]
Inserting rows into `train`: 9 rows [00:00, 3080.27 rows/s]
Created table ‘test’.Inserting rows into `test`: 0 rows [00:00, ? rows/s]
Inserting rows into `test`: 3 rows [00:00, 1333.92 rows/s]
Explanation
Sampling methods:
Stratification options:
Tips:
- Always set
seed for reproducible experiments
- Use stratified sampling for imbalanced datasets
- Combine with
.where() to sample from subsets
See also