> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Sample data for training and testing

> Create train, validation, and test splits in Pixeltable using reproducible row sampling, stratification, and seeded random shuffles.

<a href="https://kaggle.com/kernels/welcome?src=https://github.com/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/data/data-sampling.ipynb" id="openKaggle" target="_blank" rel="noopener noreferrer"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open in Kaggle" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://colab.research.google.com/github/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/data/data-sampling.ipynb" id="openColab" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://raw.githubusercontent.com/pixeltable/pixeltable/refs/tags/release/docs/release/howto/cookbooks/data/data-sampling.ipynb" id="downloadNotebook" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/%E2%AC%87-Download%20Notebook-blue" alt="Download Notebook" style={{ display: 'inline', margin: '0px' }} noZoom /></a>

<Tip>This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.</Tip>

export const quartoRawHtml = [`
<table>
<thead>
<tr>
<th>Need</th>
<th>Method</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Quick experiment</td>
<td style="vertical-align: middle;">Random sample of N rows</td>
</tr>
<tr>
<td style="vertical-align: middle;">Balanced classes</td>
<td style="vertical-align: middle;">Stratified by label</td>
</tr>
<tr>
<td style="vertical-align: middle;">Reproducible</td>
<td style="vertical-align: middle;">Fixed seed</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Fantastic!</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.91</td>
</tr>
<tr>
<td style="vertical-align: middle;">It is okay</td>
<td style="vertical-align: middle;">neutral</td>
<td style="vertical-align: middle;">0.5</td>
</tr>
<tr>
<td style="vertical-align: middle;">Average product</td>
<td style="vertical-align: middle;">neutral</td>
<td style="vertical-align: middle;">0.55</td>
</tr>
<tr>
<td style="vertical-align: middle;">Highly recommend</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.92</td>
</tr>
<tr>
<td style="vertical-align: middle;">Great product!</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.9</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Terrible</td>
<td style="vertical-align: middle;">negative</td>
<td style="vertical-align: middle;">0.1</td>
</tr>
<tr>
<td style="vertical-align: middle;">It is okay</td>
<td style="vertical-align: middle;">neutral</td>
<td style="vertical-align: middle;">0.5</td>
</tr>
<tr>
<td style="vertical-align: middle;">Fantastic!</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.91</td>
</tr>
<tr>
<td style="vertical-align: middle;">Highly recommend</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.92</td>
</tr>
<tr>
<td style="vertical-align: middle;">Great product!</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.9</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Terrible</td>
<td style="vertical-align: middle;">negative</td>
<td style="vertical-align: middle;">0.1</td>
</tr>
<tr>
<td style="vertical-align: middle;">It is okay</td>
<td style="vertical-align: middle;">neutral</td>
<td style="vertical-align: middle;">0.5</td>
</tr>
<tr>
<td style="vertical-align: middle;">Fantastic!</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.91</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
<th data-quarto-table-cell-role="th">score</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Fantastic!</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.91</td>
</tr>
<tr>
<td style="vertical-align: middle;">Highly recommend</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.92</td>
</tr>
<tr>
<td style="vertical-align: middle;">Great product!</td>
<td style="vertical-align: middle;">positive</td>
<td style="vertical-align: middle;">0.9</td>
</tr>
</tbody>
</table>
`, `
<table>
<thead>
<tr>
<th>Method</th>
<th>Parameter</th>
<th>Behavior</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Fixed count</td>
<td style="vertical-align: middle;"><code>n=100</code></td>
<td style="vertical-align: middle;">Exactly 100 rows</td>
</tr>
<tr>
<td style="vertical-align: middle;">Percentage</td>
<td style="vertical-align: middle;"><code>fraction=0.1</code></td>
<td style="vertical-align: middle;">10% of rows</td>
</tr>
<tr>
<td style="vertical-align: middle;">Per-class</td>
<td style="vertical-align: middle;"><code>n_per_stratum=10</code></td>
<td style="vertical-align: middle;">10 from each class</td>
</tr>
</tbody>
</table>
`, `
<table>
<thead>
<tr>
<th>Use case</th>
<th>Parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Proportional</td>
<td style="vertical-align: middle;"><code>fraction=0.1, stratify_by=col</code></td>
</tr>
<tr>
<td style="vertical-align: middle;">Equal allocation</td>
<td style="vertical-align: middle;"><code>n_per_stratum=10, stratify_by=col</code></td>
</tr>
<tr>
<td style="vertical-align: middle;">Reproducible</td>
<td style="vertical-align: middle;">Add <code>seed=42</code></td>
</tr>
</tbody>
</table>
`];

Create training, validation, and test splits with random or stratified
sampling.

## Problem

You have a large dataset and need to create subsets for ML
training—random samples for quick experiments, stratified samples for
balanced classes, or reproducible splits for benchmarking.

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[0] }} />

## Solution

**What’s in this recipe:**

* Random sampling with `sample(n=...)`
* Percentage-based sampling with `sample(fraction=...)`
* Stratified sampling with `stratify_by=`

You use `query.sample()` to create random subsets, with optional
stratification for balanced class distribution.

### Setup

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
%pip install -qU pixeltable
```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
import pixeltable as pxt

# Create a fresh directory
pxt.drop_dir('sampling_demo', force=True)
pxt.create_dir('sampling_demo')
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'sampling\_demo'.
  \<pixeltable.catalog.dir.Dir at 0x1471b08e0>
</pre>

### Create sample dataset

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Create a dataset with labels
data = pxt.create_table(
    'sampling_demo/data',
    {'text': pxt.String, 'label': pxt.String, 'score': pxt.Float},
)

# Insert sample data with imbalanced classes
samples = [
    {'text': 'Great product!', 'label': 'positive', 'score': 0.9},
    {'text': 'Love it', 'label': 'positive', 'score': 0.85},
    {'text': 'Amazing quality', 'label': 'positive', 'score': 0.95},
    {'text': 'Best purchase ever', 'label': 'positive', 'score': 0.88},
    {'text': 'Highly recommend', 'label': 'positive', 'score': 0.92},
    {'text': 'Fantastic!', 'label': 'positive', 'score': 0.91},
    {'text': 'Terrible', 'label': 'negative', 'score': 0.1},
    {'text': 'Waste of money', 'label': 'negative', 'score': 0.15},
    {'text': 'It is okay', 'label': 'neutral', 'score': 0.5},
    {'text': 'Average product', 'label': 'neutral', 'score': 0.55},
]
data.insert(samples)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created table 'data'.

  Inserting rows into \`data\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`data\`: 10 rows \[00:00, 857.13 rows/s]
  Inserted 10 rows with 0 errors.
  10 rows inserted, 20 values computed.
</pre>

### Random sampling

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Sample exactly N rows
data.sample(n=5, seed=42).collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[1] }} />

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Sample a percentage of rows
sample_50pct = data.sample(fraction=0.5, seed=42).collect()
```

### Stratified sampling

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Stratified sampling: 50% from each class
data.sample(fraction=0.5, stratify_by=data.label, seed=42).collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[2] }} />

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Equal allocation: N rows from each class
data.sample(n_per_stratum=1, stratify_by=data.label, seed=42).collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[3] }} />

### Sampling from filtered data

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Sample from filtered query (high-confidence predictions only)
data.where(data.score > 0.8).sample(n=3, seed=42).collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[4] }} />

### Persist samples as tables

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Create a persistent table from a sample for dev/test
train_sample = data.sample(fraction=0.8, seed=42)
test_sample = data.sample(fraction=0.2, seed=43)

# Persist as new tables
train_table = pxt.create_table('sampling_demo/train', source=train_sample)
test_table = pxt.create_table('sampling_demo/test', source=test_sample)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created table 'train'.

  Inserting rows into \`train\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`train\`: 9 rows \[00:00, 3080.27 rows/s]
  Created table 'test'.

  Inserting rows into \`test\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`test\`: 3 rows \[00:00, 1333.92 rows/s]
</pre>

## Explanation

**Sampling methods:**

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[5] }} />

**Stratification options:**

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[6] }} />

**Tips:**

* Always set `seed` for reproducible experiments
* Use stratified sampling for imbalanced datasets
* Combine with `.where()` to sample from subsets

## See also

* [Export for ML
  training](/howto/cookbooks/data/data-export-pytorch) -
  PyTorch DataLoader export
* [Import Hugging Face
  datasets](/howto/cookbooks/data/data-import-huggingface) -
  Load pre-split datasets
