Methods
collect()
Signature:
distinct()
Remove duplicate rows from this DataFrame.
Note that grouping will be applied to the rows based on the select clause of this Dataframe. In the absence of a select clause, by default, all columns are selected in the grouping.
Signature:
addresses
.
addresses
CA
group_by()
Add a group-by clause to this DataFrame.
Variants:
- group_by(
base table
): group a component view by their respective base table rows - group_by(
expr
, …): group by the given expressions
grouping_items
(Any): expressions to group by
- DataFrame: A new DataFrame with the specified group-by clause.
head()
Return the first n rows of the DataFrame, in insertion order of the underlying Table.
head() is not supported for joins.
Signature:
n
(int) =10
: Number of rows to select. Default is 10.
- DataFrameResultSet: A DataFrameResultSet with the first n rows of the DataFrame.
join()
Join this DataFrame with a table.
Signature:
-
other
(catalog.Table): the table to join with -
on
(exprs.Expr | Sequence[exprs.ColumnRef] | None): the join condition, which can be either a) references to one or more columns or b) a boolean expression. -
column references: implies an equality predicate that matches columns in both this DataFrame and
other
by name.- column in
other
: A column with that same name must be present in this DataFrame, and it must be unique (otherwise the join is ambiguous). - column in this DataFrame: A column with that same name must be present in
other
.
- column in
- boolean expression: The expressions must be valid in the context of the joined tables.
-
how
(plan.JoinType.LiteralType) =inner
: the type of join to perform. -
'inner'
: only keep rows that have a match in both -
'left'
: keep all rows from this DataFrame and only matching rows from the other table -
'right'
: keep all rows from the other table and only matching rows from this DataFrame -
'full_outer'
: keep all rows from both this DataFrame and the other table -
'cross'
: Cartesian product; noon
condition allowed
- DataFrame: A new DataFrame.
on=t3.id
here,
because that would be ambiguous, since both t1 and t2 have a column named id):
limit()
Limit the number of rows in the DataFrame.
Signature:
n
(int): Number of rows to select.
- DataFrame: A new DataFrame with the specified limited rows.
order_by()
Add an order-by clause to this DataFrame.
Signature:
-
expr_list
(exprs.Expr): expressions to order by -
asc
(bool) =True
: whether to order in ascending order (True) or descending order (False). Default is True.
- DataFrame: A new DataFrame with the specified order-by clause.
sample()
Return a new DataFrame specifying a sample of rows from the DataFrame, considered in a shuffled order.
The size of the sample can be specified in three ways:
n
: the total number of rows to produce as a samplen_per_stratum
: the number of rows to produce per stratum as a samplefraction
: the fraction of available rows to produce as a sample
-
n
(Optional[int]): Total number of rows to produce as a sample. -
n_per_stratum
(Optional[int]): Number of rows to produce per stratum as a sample. This parameter is only valid ifstratify_by
is specified. Only one ofn
orn_per_stratum
can be specified. -
fraction
(Optional[float]): Fraction of available rows to produce as a sample. This parameter is not usable withn
orn_per_stratum
. The fraction must be between 0.0 and 1.0. -
seed
(Optional[int]): Random seed for reproducible shuffling -
stratify_by
(Any): If specified, the sample will be stratified by these values.
- DataFrame: A new DataFrame which specifies the sampled rows
person
containing the field ‘age’, we can create samples of the table in various ways:
Sample 100 rows from the above Table:
select()
Select columns or expressions from the DataFrame.
Signature:
-
items
(Any): expressions to be selected -
named_items
(Any): named expressions to be selected
- DataFrame: A new DataFrame with the specified select list.
age >= 18
where ‘age’ is
another column in table t:
show()
Signature:
tail()
Return the last n rows of the DataFrame, in insertion order of the underlying Table.
tail() is not supported for joins.
Signature:
n
(int) =10
: Number of rows to select. Default is 10.
- DataFrameResultSet: A DataFrameResultSet with the last n rows of the DataFrame.
to_coco_dataset()
Convert the dataframe to a COCO dataset.
This dataframe must return a single json-typed output column in the following format: { ‘image’: PIL.Image.Image, ‘annotations’: [ { ‘bbox’: [x: int, y: int, w: int, h: int], ‘category’: str | int, }, … ], }
Signature:
- Path: Path to the COCO dataset file.
to_pytorch_dataset()
Convert the dataframe to a pytorch IterableDataset suitable for parallel loading
with torch.utils.data.DataLoader.
This method requires pyarrow >= 13, torch and torchvision to work.
This method serializes data so it can be read from disk efficiently and repeatedly without re-executing the query. This data is cached to disk for future re-use.
Signature:
image_format
(str) =pt
: format of the images. Can be ‘pt’ (pytorch tensor) or ‘np’ (numpy array). ‘np’ means image columns return as an RGB uint8 array of shape HxWxC. ‘pt’ means image columns return as a CxHxW tensor with values in [0,1] and type torch.float32. (the format output by torchvision.transforms.ToTensor())
- ‘torch.utils.data.IterableDataset’: A pytorch IterableDataset: Columns become fields of the dataset, where rows are returned as a dictionary compatible with torch.utils.data.DataLoader default collation.
where()
Filter rows based on a predicate.
Signature:
pred
(exprs.Expr): the predicate to filter rows
- DataFrame: A new DataFrame with the specified predicates replacing the where-clause.