Sampling Data
Learn how to create representative samples from your data for analysis, testing, and machine learning.
What is Sampling?
Sampling in Pixeltable allows you to select a subset of rows from a table or view. This is a crucial technique in data analysis and machine learning for creating smaller, manageable datasets that are representative of the whole.
You would use sampling to:
- Explore Data: Quickly get a feel for a large dataset without processing all of it.
- Develop and Test: Create smaller datasets for faster development cycles and testing of data pipelines.
- Train ML Models: Generate balanced or representative training sets, especially with large and imbalanced data.
- Reduce Computational Cost: Perform expensive computations on a smaller subset of data.
Pixeltable provides several methods for sampling, including random sampling of a fixed number of rows, sampling a fraction of the data, and stratified sampling to ensure representation across different subgroups.
Basic Sampling Methods
Pixeltable offers simple ways to draw random samples from your data.
Use n
to get a fixed number of randomly selected rows. This is useful when you need a dataset of a specific size.
Use n
to get a fixed number of randomly selected rows. This is useful when you need a dataset of a specific size.
Use fraction
to get a percentage of the total rows. This is useful for scaling down a dataset proportionally. The value should be between 0.0 and 1.0.
The seed
parameter ensures that your sampling is deterministic. Using the same seed will always produce the same sample, which is critical for reproducible experiments and tests.
Stratified Sampling
Stratified sampling is an advanced technique that ensures subgroups within your data are represented proportionally in the sample. You can stratify your data based on one or more columns.
Usage and Limitations
The sample()
operation has specific rules about how it can be used in a query chain.
Chaining with `where()`
You can apply a where()
clause before sample()
to filter the data before sampling. This is the most common way to chain operations with sample()
.
Creating Snapshots and Tables
A common use case for sample()
is to create a smaller, persistent snapshot or a new table for development, testing, or analysis.
Limitations
The sample()
operation cannot be chained with most other DataFrame operations like join()
, group_by()
, order_by()
, or limit()
. It also cannot be used to create a view
. These limitations exist to ensure the statistical properties of the sample are well-defined.
Key Concepts
Random Sampling
Selects rows randomly from the entire dataset, either a fixed number or a fraction.
Stratified Sampling
Divides the data into subgroups (strata) and samples from each, ensuring representation.
Reproducibility
Using a seed
ensures that you get the same sample every time, which is crucial for experiments.
Performance
Sampling is a highly optimized operation that can be performed efficiently on very large datasets.