Need	Method
Quick experiment	Random sample of N rows
Balanced classes	Stratified by label
Reproducible	Fixed seed

text	label	score
Fantastic!	positive	0.91
It is okay	neutral	0.5
Average product	neutral	0.55
Highly recommend	positive	0.92
Great product!	positive	0.9

text	label	score
Terrible	negative	0.1
It is okay	neutral	0.5
Fantastic!	positive	0.91
Highly recommend	positive	0.92
Great product!	positive	0.9

text	label	score
Terrible	negative	0.1
It is okay	neutral	0.5
Fantastic!	positive	0.91

text	label	score
Fantastic!	positive	0.91
Highly recommend	positive	0.92
Great product!	positive	0.9

Method	Parameter	Behavior
Fixed count	`n=100`	Exactly 100 rows
Percentage	`fraction=0.1`	10% of rows
Per-class	`n_per_stratum=10`	10 from each class

Use case	Parameters
Proportional	`fraction=0.1, stratify_by=col`
Equal allocation	`n_per_stratum=10, stratify_by=col`
Reproducible	Add `seed=42`

## Solution **What’s in this recipe:** * Random sampling with `sample(n=...)` * Percentage-based sampling with `sample(fraction=...)` * Stratified sampling with `stratify_by=` You use `query.sample()` to create random subsets, with optional stratification for balanced class distribution. ### Setup ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} %pip install -qU pixeltable ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} import pixeltable as pxt # Create a fresh directory pxt.drop_dir('sampling_demo', force=True) pxt.create_dir('sampling_demo') ```

  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'sampling\_demo'.
  \

### Create sample dataset ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create a dataset with labels data = pxt.create_table( 'sampling_demo/data', {'text': pxt.String, 'label': pxt.String, 'score': pxt.Float}, ) # Insert sample data with imbalanced classes samples = [ {'text': 'Great product!', 'label': 'positive', 'score': 0.9}, {'text': 'Love it', 'label': 'positive', 'score': 0.85}, {'text': 'Amazing quality', 'label': 'positive', 'score': 0.95}, {'text': 'Best purchase ever', 'label': 'positive', 'score': 0.88}, {'text': 'Highly recommend', 'label': 'positive', 'score': 0.92}, {'text': 'Fantastic!', 'label': 'positive', 'score': 0.91}, {'text': 'Terrible', 'label': 'negative', 'score': 0.1}, {'text': 'Waste of money', 'label': 'negative', 'score': 0.15}, {'text': 'It is okay', 'label': 'neutral', 'score': 0.5}, {'text': 'Average product', 'label': 'neutral', 'score': 0.55}, ] data.insert(samples) ```

  Created table 'data'.

  Inserting rows into \`data\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`data\`: 10 rows \[00:00, 857.13 rows/s]
  Inserted 10 rows with 0 errors.
  10 rows inserted, 20 values computed.

### Random sampling ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Sample exactly N rows data.sample(n=5, seed=42).collect() ```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Sample a percentage of rows sample_50pct = data.sample(fraction=0.5, seed=42).collect() ``` ### Stratified sampling ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Stratified sampling: 50% from each class data.sample(fraction=0.5, stratify_by=data.label, seed=42).collect() ```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Equal allocation: N rows from each class data.sample(n_per_stratum=1, stratify_by=data.label, seed=42).collect() ```

### Sampling from filtered data ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Sample from filtered query (high-confidence predictions only) data.where(data.score > 0.8).sample(n=3, seed=42).collect() ```

### Persist samples as tables ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Create a persistent table from a sample for dev/test train_sample = data.sample(fraction=0.8, seed=42) test_sample = data.sample(fraction=0.2, seed=43) # Persist as new tables train_table = pxt.create_table('sampling_demo/train', source=train_sample) test_table = pxt.create_table('sampling_demo/test', source=test_sample) ```

  Created table 'train'.

  Inserting rows into \`train\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`train\`: 9 rows \[00:00, 3080.27 rows/s]
  Created table 'test'.

  Inserting rows into \`test\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`test\`: 3 rows \[00:00, 1333.92 rows/s]

## Explanation **Sampling methods:**

**Stratification options:**

**Tips:** * Always set `seed` for reproducible experiments * Use stratified sampling for imbalanced datasets * Combine with `.where()` to sample from subsets ## See also * [Export for ML training](/howto/cookbooks/data/data-export-pytorch) - PyTorch DataLoader export * [Import Hugging Face datasets](/howto/cookbooks/data/data-import-huggingface) - Load pre-split datasets