Learn how to create representative samples from your data for analysis, testing, and machine learning.
Sampling in Pixeltable allows you to select a subset of rows from a table or view. This is a crucial technique in data analysis and machine learning for creating smaller, manageable datasets that are representative of the whole.
You would use sampling to:
Pixeltable provides several methods for sampling, including random sampling of a fixed number of rows, sampling a fraction of the data, and stratified sampling to ensure representation across different subgroups.
Pixeltable offers simple ways to draw random samples from your data.
Use n
to get a fixed number of randomly selected rows. This is useful when you need a dataset of a specific size.
Stratified sampling is an advanced technique that ensures subgroups within your data are represented proportionally in the sample. You can stratify your data based on one or more columns.
Sample `n_per_stratum`
This method samples a fixed number of rows from each subgroup (stratum). This is useful for ensuring that even small subgroups are represented in your sample.
Sample `n` with stratification
This method samples a total of n
rows, with the number of rows from each stratum proportional to its size in the original dataset.
Sample `fraction` with stratification
This method samples a fraction of rows from each stratum.
The sample()
operation has specific rules about how it can be used in a query chain.
You can apply a where()
clause before sample()
to filter the data before sampling. This is the most common way to chain operations with sample()
.
A common use case for sample()
is to create a smaller, persistent snapshot or a new table for development, testing, or analysis.
The sample()
operation cannot be chained with most other DataFrame operations like join()
, group_by()
, order_by()
, or limit()
. It also cannot be used to create a view
. These limitations exist to ensure the statistical properties of the sample are well-defined.
Selects rows randomly from the entire dataset, either a fixed number or a fraction.
Divides the data into subgroups (strata) and samples from each, ensuring representation.
Using a seed
ensures that you get the same sample every time, which is crucial for experiments.
Sampling is a highly optimized operation that can be performed efficiently on very large datasets.