Export data for analytics - Pixeltable Documentation

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Export Pixeltable tables and queries to CSV, JSON, and Apache Iceberg for spreadsheets, scripts, and the lakehouse.

Problem

You’ve used Pixeltable to turn unstructured files (images, video, audio, documents) into structured columns: labels, metadata, embeddings, scores. Now an analytics tool or another team needs that data. Pixeltable columns aren’t always plain scalars, and formats differ in what they preserve. Three things about your data decide which format fits:

Nested structures: are you exporting embedding arrays or metadata objects?
Rich types: do you have columns whose type a reader couldn’t reliably infer from the values (timestamps, floats vs. ints, zero-padded IDs)?
Media references: do the exported media paths need to point to remote storage (S3, GCS, R2) so others can fetch the files? An export carries paths, not the files, and a path to a file on one person’s machine is oftentimes not useful to anyone else.

Solution

What’s in this recipe:

Export a table or filtered query to CSV, JSON, and Apache Iceberg
Keep media references portable instead of writing local file paths
Use advanced Iceberg options for existing tables, embeddings, and type overrides

Pixeltable’s export helpers live in the pxt.io module, one function per destination. Here’s how each handles the three factors above:

export_iceberg() streams into an Apache Iceberg catalog you supply; the other two write a single file. All three take a table or query as the first argument, so you pick exactly which columns to export, typically the structured columns your destination needs, such as labels, metadata, and embeddings. Exports mirror how Pixeltable already stores your data: media files live in a storage layer (Pixeltable’s media store, or your own S3/R2 bucket), while your tables hold the structured columns plus paths to those files. So an export sends the structured columns to your analytics format and writes the path to each media file, never the bytes, which is why others can fetch a file only when its path points to shared storage (covered below).

Setup

%pip install -qU pixeltable pyiceberg

import pixeltable as pxt
import pixeltable.functions as pxtf

# Create a fresh directory
pxt.drop_dir('export_demo', force=True, if_not_exists='ignore')
pxt.create_dir('export_demo')

Create sample data

# A table of images, standing in for a multimodal pipeline
images = pxt.create_table(
    'export_demo/images', {'image': pxt.Image, 'label': pxt.String}
)

base = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images'
images.insert(
    [
        {'image': f'{base}/000000000036.jpg', 'label': 'cat'},
        {'image': f'{base}/000000000090.jpg', 'label': 'bicycle'},
        {'image': f'{base}/000000000139.jpg', 'label': 'living room'},
    ]
)

Created table ‘images’.
Inserted 3 rows with 0 errors in 0.59 s (5.04 rows/s)
3 rows inserted.

We derive columns with Pixeltable’s built-in image functions, which run locally and need no API key. The pattern is the same for model-backed work: swap pxtf.image.width for pxtf.openai.vision, a Hugging Face object detector, or your own @pxt.udf, and everything downstream, including the export, is unchanged.

images.add_computed_column(width=pxtf.image.width(images.image))
images.add_computed_column(height=pxtf.image.height(images.image))
images.add_computed_column(mode=pxtf.image.mode(images.image))

# The derived columns we'll export
features = images.select(
    images.label, images.width, images.height, images.mode
)
features.collect()

Added 3 column values with 0 errors in 0.05 s (61.38 rows/s)
Added 3 column values with 0 errors in 0.03 s (116.68 rows/s)
Added 3 column values with 0 errors in 0.02 s (120.96 rows/s)

Export to CSV

export_csv() writes a query or table to a single file. It encodes nested Json and Array values as JSON strings and skips binary columns.

import pandas as pd
import tempfile
from pathlib import Path

out_dir = Path(tempfile.mkdtemp())

pxt.io.export_csv(features, out_dir / 'image_features.csv')

# Read it back the way an external tool would
pd.read_csv(out_dir / 'image_features.csv')

Export to JSON

export_json() writes JSONL (one JSON object per line), which streams well and is easy to read line by line. Nested Json and Array columns keep their native structure rather than collapsing to a string.

pxt.io.export_json(features, out_dir / 'image_features.jsonl')
pd.read_json(out_dir / 'image_features.jsonl', lines=True)

Export to Apache Iceberg

export_iceberg() streams rows into a catalog you provide. For a self-contained demo we use PyIceberg’s SqlCatalog with a local SQLite metadata store and a directory warehouse. Point this at your real catalog (REST, Glue, Hive) in production and nothing else changes.

from pyiceberg.catalog.sql import SqlCatalog

warehouse = out_dir / 'warehouse'
warehouse.mkdir(parents=True, exist_ok=True)

catalog = SqlCatalog(
    'demo',
    uri=f'sqlite:///{warehouse}/catalog.db',
    warehouse=f'file://{warehouse}',
)

Pass a table or query, the catalog, and a table name. Iceberg names are namespace-qualified, written as namespace.table: in analytics.image_features, analytics is the namespace (a grouping of tables, like a schema in a SQL database) and image_features is the table. The namespace is created automatically if it’s new. Reading the table back returns the typed schema Iceberg stored, not just the values:

pxt.io.export_iceberg(features, catalog, 'analytics.image_features')

# The table carries a typed schema, not just a header row of values
table = catalog.load_table('analytics.image_features')
table.schema()

Schema(NestedField(field_id=1, name=‘label’, field_type=StringType(), required=False), NestedField(field_id=2, name=‘width’, field_type=LongType(), required=False), NestedField(field_id=3, name=‘height’, field_type=LongType(), required=False), NestedField(field_id=4, name=‘mode’, field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])

# And the rows themselves (Spark, DuckDB, or Snowflake would query the same table)
table.scan().to_arrow().to_pandas()

Advanced Iceberg options

export_iceberg() has a few extra controls for writing to a real lakehouse. Handle existing tables. You control what happens when the target table already exists with the if_exists parameter:

replace writes to a temporary table and swaps it in only on success, so a failed export leaves the existing table alone. (export_csv and export_json simply overwrite the file.)

# image_features already holds the 3 rows we exported above;
# appending the same 3 rows makes 6
pxt.io.export_iceberg(
    features, catalog, 'analytics.image_features', if_exists='append'
)

# Reload to see the new snapshot, then count the rows
catalog.load_table('analytics.image_features').scan().to_arrow().num_rows

Export embeddings. Embeddings and feature vectors are pxt.Array columns. Iceberg has no fixed-shape tensor type, so project a fixed-shape array to a list with .to_list() first; variable-shape arrays already map to Iceberg lists. (CSV and JSON serialize arrays as JSON automatically.)

import numpy as np

vectors = pxt.create_table(
    'export_demo/vectors',
    {'doc_id': pxt.Int, 'embedding': pxt.Array[(4,), pxt.Float]},  # type: ignore[misc]
)
vectors.insert(
    [
        {
            'doc_id': 0,
            'embedding': np.array([0.1, 0.2, 0.3, 0.4], dtype=np.float32),
        },
        {
            'doc_id': 1,
            'embedding': np.array([0.5, 0.6, 0.7, 0.8], dtype=np.float32),
        },
    ]
)

pxt.io.export_iceberg(
    vectors.select(vectors.doc_id, embedding=vectors.embedding.to_list()),
    catalog,
    'analytics.embeddings',
)
catalog.load_table('analytics.embeddings').scan().to_arrow().to_pandas()

Created table ‘vectors’.
Inserted 2 rows with 0 errors in 0.00 s (421.55 rows/s)

Override column types. Iceberg infers types from the Pixeltable schema. Use schema_overrides to pin a specific PyArrow type, for example to store a 32-bit pxt.Float as a 64-bit double.

import pyarrow as pa

images.add_computed_column(aspect_ratio=images.width / images.height)

pxt.io.export_iceberg(
    images.select(images.label, images.aspect_ratio),
    catalog,
    'analytics.image_ratios',
    schema_overrides={'aspect_ratio': pa.float64()},
)
catalog.load_table('analytics.image_ratios').schema().as_arrow()

Added 3 column values with 0 errors in 0.03 s (112.61 rows/s)
label: large_string
  — field metadata —
  PARQUET:field_id: ‘1’
aspect_ratio: double
  — field metadata —
  PARQUET:field_id: ‘2’

Export a filtered query

Every export function takes a query, so you can filter rows and project columns before writing, in any format. Here we keep only wide images and two columns.

wide = images.where(images.width > 500).select(images.label, images.width)

pxt.io.export_csv(wide, out_dir / 'wide_images.csv')
pd.read_csv(out_dir / 'wide_images.csv')

Keep media references portable

An export carries a reference to each media file, not the bytes. A media column gives you two references to choose from:

Its value (e.g. videos.video) is the local cache path on the machine that ran the export. It resolves nowhere else.
Its .fileurl (e.g. videos.video.fileurl) is where the file is stored: the https://, s3://, or gs:// address it was inserted from, or a file:// path if you inserted a local file.

The export writes a reference, not the bytes, so it can’t make a file reachable on its own. The recipient can get to your media only if it already lives in storage they can access too, which in practice means shared cloud storage (S3, GCS, R2). Media inserted from a cloud bucket or URL exports an address anyone with access can open; media inserted from a local file exports a file:// path that resolves only on the machine that ran the export. To share local media, upload it to cloud storage first (see Upload media to S3 and cloud storage). How each format handles media:

export_csv() and export_json() export .fileurl for you.
export_iceberg() inlines Image columns as bytes, but writes the local cache path for Video, Audio, and Document. Select .fileurl yourself when the table needs to travel.

The video below is inserted from a URL, so its .fileurl is that URL, and CSV exports it directly:

videos = pxt.create_table(
    'export_demo/videos', {'video': pxt.Video, 'title': pxt.String}
)
videos.insert(
    [
        {
            'video': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/tests/data/videos/bangkok_half_res.mp4',
            'title': 'bangkok',
        }
    ]
)

Created table ‘videos’.
Inserted 1 row with 0 errors in 0.65 s (1.53 rows/s)
1 row inserted.

pxt.io.export_csv(
    videos.select(videos.title, videos.video), out_dir / 'videos.csv'
)
# to_dict keeps the full URL visible (a DataFrame would truncate it)
pd.read_csv(out_dir / 'videos.csv').to_dict('records')

[{‘title’: ‘bangkok’,
  ‘video’: ‘https://raw.githubusercontent.com/pixeltable/pixeltable/main/tests/data/videos/bangkok_half_res.mp4&#x27;\}\]

Exported directly to Iceberg, the same column becomes a machine-local path:

pxt.io.export_iceberg(
    videos.select(videos.video), catalog, 'analytics.video_local'
)
catalog.load_table('analytics.video_local').scan().to_arrow().to_pylist()

[{‘video’: ‘/Users/username/.pixeltable/file_cache/0eabea0a34ca471ea07a153f6485d552_0_b2eb75710e902d63b0b8ff19dc432062618731a0ec7df6f68beee7ad5fac98be.mp4’}]

Export .fileurl instead and Iceberg gets the portable URL too. With media in S3 or R2, this is the s3:///gs:// URL a lakehouse engine can open:

videos.add_computed_column(duration=pxtf.video.get_duration(videos.video))

pxt.io.export_iceberg(
    videos.select(
        videos.title, videos.duration, video_url=videos.video.fileurl
    ),
    catalog,
    'analytics.videos',
)
catalog.load_table('analytics.videos').scan().to_arrow().to_pylist()

Added 1 column value with 0 errors in 0.04 s (28.16 rows/s)
[{‘title’: ‘bangkok’,
  ‘duration’: 18.479999542236328,
  ‘video_url’: ‘https://raw.githubusercontent.com/pixeltable/pixeltable/main/tests/data/videos/bangkok_half_res.mp4&#x27;\}\]

Explanation

Every column has a Pixeltable type, from scalars like Int and Float to the multimodal types (Image, Video, Audio, Document, Array, Json) that make it possible to store images, video, and embeddings in the same table. On export, each type maps to the closest thing the destination format supports:

Iceberg details:

Rows are streamed as PyArrow RecordBatches; tune the in-memory batch size with batch_size_bytes (default 128 MB) for large exports.
Pass namespace.table; the namespace is created automatically, and a bare name is rejected.
A fixed-shape pxt.Array is rejected (project it with .to_list()), as is a pxt.Json column that can’t reduce to one concrete type. Pin those with schema_overrides.

​Problem

​Solution

​Setup

​Create sample data

​Export to CSV

​Export to JSON

​Export to Apache Iceberg

​Advanced Iceberg options

​Export a filtered query

​Keep media references portable

​Explanation

​See also

Problem

Solution

Setup

Create sample data

Export to CSV

Export to JSON

Export to Apache Iceberg

Advanced Iceberg options

Export a filtered query

Keep media references portable

Explanation

See also