What is Sampling?

Sampling in Pixeltable allows you to select a subset of rows from a table or view. This is a crucial technique in data analysis and machine learning for creating smaller, manageable datasets that are representative of the whole.

You would use sampling to:

  • Explore Data: Quickly get a feel for a large dataset without processing all of it.
  • Develop and Test: Create smaller datasets for faster development cycles and testing of data pipelines.
  • Train ML Models: Generate balanced or representative training sets, especially with large and imbalanced data.
  • Reduce Computational Cost: Perform expensive computations on a smaller subset of data.

Pixeltable provides several methods for sampling, including random sampling of a fixed number of rows, sampling a fraction of the data, and stratified sampling to ensure representation across different subgroups.

import pixeltable as pxt
import random

# Load a sample dataset
pop_data = pxt.create_table(
    'population_data',
    {
        'country': pxt.String,
        'continent': pxt.String,
        'pop_2023': pxt.Int,
    },
    if_exists='ignore',
)

country = random.choices(['USA', 'Canada', 'Mexico'], k=10)
continent = random.choices(['North America', 'South America'], k=10)
pop_2023 = random.choices(range(0, 1000000), k=10)
data = []

for i in range(10):
    data.append({
        'country': country[i],
        'continent': continent[i],
        'pop_2023': pop_2023[i],
    })

pop_data.insert(data)

# Get a random sample of 10 countries
sample_df = pop_data.sample(n=10, seed=42)
sample_df.collect()

Basic Sampling Methods

Pixeltable offers simple ways to draw random samples from your data.

Use n to get a fixed number of randomly selected rows. This is useful when you need a dataset of a specific size.

# Get a random sample of 5 rows
random_sample = pop_data.sample(n=5, seed=123)
random_sample.collect()

Stratified Sampling

Stratified sampling is an advanced technique that ensures subgroups within your data are represented proportionally in the sample. You can stratify your data based on one or more columns.

Usage and Limitations

The sample() operation has specific rules about how it can be used in a query chain.

Chaining with `where()`

You can apply a where() clause before sample() to filter the data before sampling. This is the most common way to chain operations with sample().

# Sample 5 countries from Asia
asian_countries_sample = pop_data.where(
    pop_data.continent == 'Asia'
).sample(n=5, seed=42)

asian_countries_sample.collect()

Creating Snapshots and Tables

A common use case for sample() is to create a smaller, persistent snapshot or a new table for development, testing, or analysis.

# Create a sample DataFrame
sample_df = pop_data.sample(fraction=0.1, seed=42)

# Create a persistent snapshot from the sample
pxt.create_snapshot('population_sample_snapshot', sample_df, if_exists='replace')

# Create a new table from the sample
pxt.create_table('population_sample_table', source=sample_df, if_exists='replace')

Limitations

The sample() operation cannot be chained with most other DataFrame operations like join(), group_by(), order_by(), or limit(). It also cannot be used to create a view. These limitations exist to ensure the statistical properties of the sample are well-defined.

# This will raise an error
try:
    pop_data.order_by(pop_data.pop_2023).sample(n=10)
except Exception as e:
    print(e)

Key Concepts

Random Sampling

Selects rows randomly from the entire dataset, either a fixed number or a fraction.

Stratified Sampling

Divides the data into subgroups (strata) and samples from each, ensuring representation.

Reproducibility

Using a seed ensures that you get the same sample every time, which is crucial for experiments.

Performance

Sampling is a highly optimized operation that can be performed efficiently on very large datasets.