module pixeltable.io

Functions for importing and exporting Pixeltable data.

func create_label_studio_project()

Signature

create_label_studio_project(
    t: Table,
    label_config: str,
    name: str | None = None,
    title: str | None = None,
    media_import_method: Literal['post', 'file', 'url'] = 'post',
    col_mapping: dict[str, str] | None = None,
    sync_immediately: bool = True,
    s3_configuration: dict[str, Any] | None = None,
    **kwargs: Any
) -> UpdateStatus

Create a new Label Studio project and link it to the specified Table.

A tutorial notebook with fully worked examples can be found here: Using Label Studio for Annotations with Pixeltable

The required parameter label_config specifies the Label Studio project configuration, in XML format, as described in the Label Studio documentation. The linked project will have one column for each data field in the configuration; for example, if the configuration has an entry

<Image name="image_obj" value="$image"/>

then the linked project will have a column named image. In addition, the linked project will always have a JSON-typed column annotations representing the output. By default, Pixeltable will link each of these columns to a column of the specified Table with the same name. If any of the data fields are missing, an exception will be raised. If the annotations column is missing, it will be created. The default names can be overridden by specifying an optional col_mapping, with Pixeltable column names as keys and Label Studio field names as values. In all cases, the Pixeltable columns must have types that are consistent with their corresponding Label Studio fields; otherwise, an exception will be raised. The API key and URL for a valid Label Studio server must be specified in Pixeltable config. Either:

Set the LABEL_STUDIO_API_KEY and LABEL_STUDIO_URL environment variables; or
Specify api_key and url fields in the label-studio section of $PIXELTABLE_HOME/config.toml.

Requirements:

pip install label-studio-sdk
pip install boto3 (if using S3 import storage)

Parameters:

t (Table): The table to link to.
label_config (str): The Label Studio project configuration, in XML format.
name (str | None): An optional name for the new project in Pixeltable. If specified, must be a valid Pixeltable identifier and must not be the name of any other external data store linked to t. If not specified, a default name will be used of the form ls_project_0, ls_project_1, etc.
title (str | None): An optional title for the Label Studio project. This is the title that annotators will see inside Label Studio. Unlike name, it does not need to be an identifier and does not need to be unique. If not specified, the table name t.name will be used.
media_import_method (Literal['post', 'file', 'url'], default: 'post'): The method to use when transferring media files to Label Studio:
- post: Media will be sent to Label Studio via HTTP post. This should generally only be used for prototyping; due to restrictions in Label Studio, it can only be used with projects that have just one data field, and does not scale well.
- file: Media will be sent to Label Studio as a file on the local filesystem. This method can be used if Pixeltable and Label Studio are running on the same host.
- url: Media will be sent to Label Studio as externally accessible URLs. This method cannot be used with local media files or with media generated by computed columns. The default is post.
col_mapping (dict[str, str] | None): An optional mapping of local column names to Label Studio fields.
sync_immediately (bool, default: True): If True, immediately perform an initial synchronization by exporting all rows of the table as Label Studio tasks.
s3_configuration (dict[str, Any] | None): If specified, S3 import storage will be configured for the new project. This can only be used with media_import_method='url', and if media_import_method='url' and any of the media data is referenced by s3:// URLs, then it must be specified in order for such media to display correctly in the Label Studio interface. The items in the s3_configuration dictionary correspond to kwarg parameters of the Label Studio connect_s3_import_storage method, as described in the Label Studio connect_s3_import_storage docs. bucket must be specified; all other parameters are optional. If credentials are not specified explicitly, Pixeltable will attempt to retrieve them from the environment (such as from ~/.aws/credentials). If a title is not specified, Pixeltable will use the default 'Pixeltable-S3-Import-Storage'. All other parameters use their Label Studio defaults.
kwargs (Any): Additional keyword arguments are passed to the start_project method in the Label Studio SDK, as described in the Label Studio start_project docs.

Returns:

UpdateStatus: An UpdateStatus representing the status of any synchronization operations that occurred.

Examples: Create a Label Studio project whose tasks correspond to videos stored in the video_col column of the table tbl:

config = """
<View>
    <Video name="video_obj" value="$video_col"/>
    <Choices name="video-category" toName="video" showInLine="true">
        <Choice value="city"/>
        <Choice value="food"/>
        <Choice value="sports"/>
    </Choices>
</View>
"""
create_label_studio_project(tbl, config)

Create a Label Studio project with the same configuration, using media_import_method='url', whose media are stored in an S3 bucket:

create_label_studio_project(
    tbl,
    config,
    media_import_method='url',
    s3_configuration={'bucket': 'my-bucket', 'region_name': 'us-east-2'},
)

func export_images_as_fo_dataset()

Signature

export_images_as_fo_dataset(
    tbl: pxt.Table,
    images: exprs.Expr,
    image_format: str = 'webp',
    classifications: exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None = None,
    detections: exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None = None
) -> fo.Dataset

Export images from a Pixeltable table as a Voxel51 dataset. The data must consist of a single column (or expression) containing image data, along with optional additional columns containing labels. Currently, only classification and detection labels are supported. The Working with Voxel51 in Pixeltable tutorial contains a fully worked example showing how to export data from a Pixeltable table and load it into Voxel51. Images in the dataset that already exist on disk will be exported directly, in whatever format they are stored in. Images that are not already on disk (such as frames extracted using a frame_iterator) will first be written to disk in the specified image_format. The label parameters accept one or more sets of labels of each type. If a single Expr is provided, then it will be exported as a single set of labels with a default name such as classifications. (The single set of labels may still containing multiple individual labels; see below.) If a list of Exprs is provided, then each one will be exported as a separate set of labels with a default name such as classifications, classifications_1, etc. If a dictionary of Exprs is provided, then each entry will be exported as a set of labels with the specified name. Requirements:

pip install fiftyone

Parameters:

tbl (pxt.Table): The table from which to export data.
images (exprs.Expr): A column or expression that contains the images to export.
image_format (str, default: 'webp'): The format to use when writing out images for export.
classifications (exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None): Optional image classification labels. If a single Expr is provided, it must be a table column or an expression that evaluates to a list of dictionaries. Each dictionary in the list corresponds to an image class and must have the following structure:
```
{'label': 'zebra', 'confidence': 0.325}
```
If multiple Exprs are provided, each one must evaluate to a list of such dictionaries.
detections (exprs.Expr | list[exprs.Expr] | dict[str, exprs.Expr] | None): Optional image detection labels. If a single Expr is provided, it must be a table column or an expression that evaluates to a list of dictionaries. Each dictionary in the list corresponds to an image detection, and must have the following structure:
```
{
    'label': 'giraffe',
    'confidence': 0.99,
    'bounding_box': [0.081, 0.836, 0.202, 0.136]  # [x, y, w, h], fractional coordinates
}
```
If multiple Exprs are provided, each one must evaluate to a list of such dictionaries.

Returns:

'fo.Dataset': A Voxel51 dataset.

Examples: Export the images in the image column of the table tbl as a Voxel51 dataset, using classification labels from tbl.classifications:

export_images_as_fo_dataset(
    tbl, tbl.image, classifications=tbl.classifications
)

func export_lancedb()

Signature

export_lancedb(
    table_or_query: pxt.Table | pxt.Query,
    db_uri: Path,
    table_name: str,
    batch_size_bytes: int = 134217728,
    if_exists: Literal['error', 'overwrite', 'append'] = 'error'
) -> None

Exports a Query’s data to a LanceDB table. This utilizes LanceDB’s streaming interface for efficient table creation, via a sequence of in-memory pyarrow RecordBatches, the size of which can be controlled with the batch_size_bytes parameter. Requirements:

pip install lancedb

Parameters:

table_or_query (Any): Table or Query to export.
db_uri (Path): Local Path to the LanceDB database.
table_name (Any): Name of the table in the LanceDB database.
batch_size_bytes (Any): Maximum size in bytes for each batch.
if_exists (Literal['error', 'overwrite', 'append'], default: 'error'): Determines the behavior if the table already exists. Must be one of the following:
- 'error': raise an error
- 'overwrite': overwrite the existing table
- 'append': append to the existing table

func export_parquet()

Signature

export_parquet(
    table_or_query: pxt.Table | pxt.Query,
    parquet_path: Path,
    partition_size_bytes: int = 100000000,
    inline_images: bool = False
) -> None

Exports a query result or table to one or more Parquet files. Requires pyarrow to be installed. It additionally writes the pixeltable metadata in a json file, which would otherwise not be available in the parquet format. Pixeltable column types are mapped to Parquet types as follows:

String: string
Int: int64
Float: float32
Bool: bool
Timestamp: timestamp[us, tz=UTC]
Date: date32
UUID: uuid
Binary: binary
Image: binary (when inline_images=True)
Audio, Video, Document: string (file paths)
Array: fixed_shape_tensor
Json: struct
- Schema is inferred from data via pyarrow.infer_type()
- Fields that contain empty dicts cannot be mapped to a Parquet type and will result in an exception

Parameters:

table_or_query (Any): Table or Query to export.
parquet_path (Any): Path to directory to write the parquet files to.
partition_size_bytes (Any): The maximum target size for each chunk. Default 100_000_000 bytes.
inline_images (Any): If True, images are stored inline in the parquet file. This is useful for small images, to be imported as pytorch dataset. But can be inefficient for large images, and cannot be imported into pixeltable. If False, will raise an error if the Query has any image column. Default False.

func import_csv()

Signature

import_csv(
    tbl_name: str,
    filepath_or_buffer: str | os.PathLike,
    schema_overrides: dict[str, typing.Any] | None = None,
    primary_key: str | list[str] | None = None,
    num_retained_versions: int = 10,
    comment: str = '',
    **kwargs: Any
) -> pixeltable.catalog.table.Table

Creates a new base table from a csv file. This is a convenience method and is equivalent to calling import_pandas(table_path, pd.read_csv(filepath_or_buffer, **kwargs), schema=schema). See the Pandas documentation for read_csv for more details. Returns:

pixeltable.catalog.table.Table: A handle to the newly created Table.

func import_excel()

Signature

import_excel(
    tbl_name: str,
    io: str | os.PathLike,
    *,
    schema_overrides: dict[str, typing.Any] | None = None,
    primary_key: str | list[str] | None = None,
    num_retained_versions: int = 10,
    comment: str = '',
    **kwargs: Any
) -> pixeltable.catalog.table.Table

Creates a new base table from an Excel (.xlsx) file. This is a convenience method and is equivalent to calling import_pandas(table_path, pd.read_excel(io, *args, **kwargs), schema=schema). See the Pandas documentation for read_excel for more details. Returns:

pixeltable.catalog.table.Table: A handle to the newly created Table.

func import_huggingface_dataset()

Signature

import_huggingface_dataset(
    table_path: str,
    dataset: datasets.Dataset | datasets.DatasetDict | datasets.IterableDataset | datasets.IterableDatasetDict,
    *,
    schema_overrides: dict[str, Any] | None = None,
    primary_key: str | list[str] | None = None,
    **kwargs: Any
) -> pxt.Table

Create a new base table from a Huggingface dataset, or dataset dict with multiple splits. Requires datasets library to be installed. HuggingFace feature types are mapped to Pixeltable column types as follows:

Value(bool): Bool
Value(int*/uint*): Int
Value(float*): Float
Value(string/large_string): String
Value(timestamp*): Timestamp
Value(date*): Date
ClassLabel: String (converted to label names)
Sequence/LargeList of numeric types: Array
Sequence/LargeList of string: Json
Sequence/LargeList of dicts: Json
Array2D-Array5D: Array (preserves shape)
Image: Image
Audio: Audio
Video: Video
Translation/TranslationVariableLanguages: Json

Parameters:

table_path (str): Path to the table.
dataset (datasets.Dataset | datasets.DatasetDict | datasets.IterableDataset | datasets.IterableDatasetDict): An instance of any of the Huggingface dataset classes: datasets.Dataset, datasets.DatasetDict, datasets.IterableDataset, datasets.IterableDatasetDict
schema_overrides (dict[str, Any] | None): If specified, then for each (name, type) pair in schema_overrides, the column with name name will be given type type, instead of being inferred from the Dataset or DatasetDict. The keys in schema_overrides should be the column names of the Dataset or DatasetDict (whether or not they are valid Pixeltable identifiers).
primary_key (str | list[str] | None): The primary key of the table (see create_table()).
kwargs (Any): Additional arguments to pass to create_table. An argument of column_name_for_split must be provided if the source is a DatasetDict. This column name will contain the split information. If None, no split information will be stored.

Returns:

pxt.Table: A handle to the newly created Table.

func import_json()

Signature

import_json(
    tbl_path: str,
    filepath_or_url: str,
    *,
    schema_overrides: dict[str, Any] | None = None,
    primary_key: str | list[str] | None = None,
    num_retained_versions: int = 10,
    comment: str = '',
    **kwargs: Any
) -> pxt.Table

Creates a new base table from a JSON file. This is a convenience method and is equivalent to calling import_data(table_path, json.loads(file_contents, **kwargs), ...), where file_contents is the contents of the specified filepath_or_url. Parameters:

tbl_path (str): The name of the table to create.
filepath_or_url (str): The path or URL of the JSON file.
schema_overrides (dict[str, Any] | None): If specified, then columns in schema_overrides will be given the specified types (see import_rows()).
primary_key (str | list[str] | None): The primary key of the table (see create_table()).
num_retained_versions (int, default: 10): The number of retained versions of the table (see create_table()).
comment (str, default: ''): A comment to attach to the table (see create_table()).
kwargs (Any): Additional keyword arguments to pass to json.loads.

Returns:

pxt.Table: A handle to the newly created Table.

func import_pandas()

Signature

import_pandas(
    tbl_name: str,
    df: pandas.core.frame.DataFrame,
    *,
    schema_overrides: dict[str, typing.Any] | None = None,
    primary_key: str | list[str] | None = None,
    num_retained_versions: int = 10,
    comment: str = ''
) -> pixeltable.catalog.table.Table

Creates a new base table from a Pandas DataFrame, with the specified name. The schema of the table will be inferred from the DataFrame. The column names of the new table will be identical to those in the DataFrame, as long as they are valid Pixeltable identifiers. If a column name is not a valid Pixeltable identifier, it will be normalized according to the following procedure:

first replace any non-alphanumeric characters with underscores;
then, preface the result with the letter ‘c’ if it begins with a number or an underscore;
then, if there are any duplicate column names, suffix the duplicates with ‘_2’, ‘_3’, etc., in column order.

Parameters:

tbl_name (str): The name of the table to create.
df (pandas.core.frame.DataFrame): The Pandas DataFrame.
schema_overrides (dict[str, typing.Any] | None): If specified, then for each (name, type) pair in schema_overrides, the column with name name will be given type type, instead of being inferred from the DataFrame. The keys in schema_overrides should be the column names of the DataFrame (whether or not they are valid Pixeltable identifiers).

Returns:

pixeltable.catalog.table.Table: A handle to the newly created Table.

func import_parquet()

Signature

import_parquet(
    table: str,
    *,
    parquet_path: str,
    schema_overrides: dict[str, Any] | None = None,
    primary_key: str | list[str] | None = None,
    **kwargs: Any
) -> pxt.Table

Creates a new base table from a Parquet file or set of files. Requires pyarrow to be installed. Parameters:

table (str): Fully qualified name of the table to import the data into.
parquet_path (str): Path to an individual Parquet file or directory of Parquet files.
schema_overrides (dict[str, Any] | None): If specified, then for each (name, type) pair in schema_overrides, the column with name name will be given type type, instead of being inferred from the Parquet dataset. The keys in schema_overrides should be the column names of the Parquet dataset (whether or not they are valid Pixeltable identifiers).
primary_key (str | list[str] | None): The primary key of the table (see create_table()).
kwargs (Any): Additional arguments to pass to create_table.

Returns:

pxt.Table: A handle to the newly created table.

func import_rows()

Signature

import_rows(
    tbl_path: str,
    rows: list[dict[str, Any]],
    *,
    schema_overrides: dict[str, Any] | None = None,
    primary_key: str | list[str] | None = None,
    num_retained_versions: int = 10,
    comment: str = ''
) -> pxt.Table

Creates a new base table from a list of dictionaries. The dictionaries must be of the form {column_name: value, ...}. Pixeltable will attempt to infer the schema of the table from the supplied data, using the most specific type that can represent all the values in a column. If schema_overrides is specified, then for each entry (column_name, type) in schema_overrides, Pixeltable will force the specified column to the specified type (and will not attempt any type inference for that column). All column types of the new table will be nullable unless explicitly specified as non-nullable in schema_overrides. Parameters:

tbl_path (str): The qualified name of the table to create.
rows (list[dict[str, Any]]): The list of dictionaries to import.
schema_overrides (dict[str, Any] | None): If specified, then columns in schema_overrides will be given the specified types as described above.
primary_key (str | list[str] | None): The primary key of the table (see create_table()).
num_retained_versions (int, default: 10): The number of retained versions of the table (see create_table()).
comment (str, default: ''): A comment to attach to the table (see create_table()).

Returns:

pxt.Table: A handle to the newly created Table.

SDK Reference

​module pixeltable.io

​func create_label_studio_project()

​func export_images_as_fo_dataset()

​func export_lancedb()

​func export_parquet()

​func import_csv()

​func import_excel()

​func import_huggingface_dataset()

​func import_json()

​func import_pandas()

​func import_parquet()

​func import_rows()

module pixeltable.io

func create_label_studio_project()

func export_images_as_fo_dataset()

func export_lancedb()

func export_parquet()

func import_csv()

func import_excel()

func import_huggingface_dataset()

func import_json()

func import_pandas()

func import_parquet()

func import_rows()