Learn how to create representative samples from your data for analysis, testing, and machine learning.
n
to get a fixed number of randomly selected rows. This is useful when you need a dataset of a specific size.Sample `n_per_stratum`
Sample `n` with stratification
n
rows, with the number of rows from each stratum proportional to its size in the original dataset.Sample `fraction` with stratification
sample()
operation has specific rules about how it can be used in a query chain.
where()
clause before sample()
to filter the data before sampling. This is the most common way to chain operations with sample()
.sample()
is to create a smaller, persistent snapshot or a new table for development, testing, or analysis.sample()
operation cannot be chained with most other DataFrame operations like join()
, group_by()
, order_by()
, or limit()
. It also cannot be used to create a view
. These limitations exist to ensure the statistical properties of the sample are well-defined.seed
ensures that you get the same sample every time, which is crucial for experiments.