This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Export Pixeltable tables and queries to CSV, JSON, and Apache Iceberg
for spreadsheets, scripts, and the lakehouse.
Problem
You’ve used Pixeltable to turn unstructured files (images, video, audio,
documents) into structured columns: labels, metadata, embeddings,
scores. Now an analytics tool or another team needs that data.
Pixeltable columns aren’t always plain scalars, and formats differ in
what they preserve. Three things about your data decide which format
fits:
- Nested structures: are you exporting embedding arrays or metadata
objects?
- Rich types: do you have columns whose type a reader couldn’t
reliably infer from the values (timestamps, floats vs. ints,
zero-padded IDs)?
- Media references: do the exported media paths need to point to
remote storage (S3, GCS, R2) so others can fetch the files? An export
carries paths, not the files, and a path to a file on one person’s
machine is oftentimes not useful to anyone else.
Solution
What’s in this recipe:
- Export a table or filtered query to CSV, JSON, and Apache Iceberg
- Keep media references portable instead of writing local file paths
- Use advanced Iceberg options for existing tables, embeddings, and type
overrides
Pixeltable’s export helpers live in the pxt.io module, one function
per destination. Here’s how each handles the three factors above:
export_iceberg() streams into an Apache
Iceberg catalog you supply; the other two
write a single file.
All three take a table or query as the first argument, so you pick
exactly which columns to export, typically the structured columns your
destination needs, such as labels, metadata, and embeddings.
Exports mirror how Pixeltable already stores your data: media files live
in a storage layer (Pixeltable’s media store, or your own S3/R2 bucket),
while your tables hold the structured columns plus paths to those files.
So an export sends the structured columns to your analytics format and
writes the path to each media file, never the bytes, which is why others
can fetch a file only when its path points to shared storage (covered
below).
Setup
%pip install -qU pixeltable pyiceberg
import pixeltable as pxt
import pixeltable.functions as pxtf
# Create a fresh directory
pxt.drop_dir('export_demo', force=True, if_not_exists='ignore')
pxt.create_dir('export_demo')
Create sample data
# A table of images, standing in for a multimodal pipeline
images = pxt.create_table(
'export_demo/images', {'image': pxt.Image, 'label': pxt.String}
)
base = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images'
images.insert(
[
{'image': f'{base}/000000000036.jpg', 'label': 'cat'},
{'image': f'{base}/000000000090.jpg', 'label': 'bicycle'},
{'image': f'{base}/000000000139.jpg', 'label': 'living room'},
]
)
Created table ‘images’.
Inserted 3 rows with 0 errors in 0.59 s (5.04 rows/s)
3 rows inserted.
We derive columns with Pixeltable’s built-in image functions, which run
locally and need no API key. The pattern is the same for model-backed
work: swap pxtf.image.width for pxtf.openai.vision, a Hugging Face
object detector, or your own @pxt.udf, and everything downstream,
including the export, is unchanged.
images.add_computed_column(width=pxtf.image.width(images.image))
images.add_computed_column(height=pxtf.image.height(images.image))
images.add_computed_column(mode=pxtf.image.mode(images.image))
# The derived columns we'll export
features = images.select(
images.label, images.width, images.height, images.mode
)
features.collect()
Added 3 column values with 0 errors in 0.05 s (61.38 rows/s)
Added 3 column values with 0 errors in 0.03 s (116.68 rows/s)
Added 3 column values with 0 errors in 0.02 s (120.96 rows/s)
Export to CSV
export_csv() writes a query or table to a single file. It encodes
nested Json and Array values as JSON strings and skips binary
columns.
import pandas as pd
import tempfile
from pathlib import Path
out_dir = Path(tempfile.mkdtemp())
pxt.io.export_csv(features, out_dir / 'image_features.csv')
# Read it back the way an external tool would
pd.read_csv(out_dir / 'image_features.csv')
Export to JSON
export_json() writes JSONL (one JSON object
per line), which streams well and is easy to read line by line. Nested
Json and Array columns keep their native structure rather than
collapsing to a string.
pxt.io.export_json(features, out_dir / 'image_features.jsonl')
pd.read_json(out_dir / 'image_features.jsonl', lines=True)
Export to Apache Iceberg
export_iceberg() streams rows into a catalog you provide. For a
self-contained demo we use PyIceberg’s
SqlCatalog with a local SQLite metadata store and a directory
warehouse. Point this at your real catalog (REST, Glue, Hive) in
production and nothing else changes.
from pyiceberg.catalog.sql import SqlCatalog
warehouse = out_dir / 'warehouse'
warehouse.mkdir(parents=True, exist_ok=True)
catalog = SqlCatalog(
'demo',
uri=f'sqlite:///{warehouse}/catalog.db',
warehouse=f'file://{warehouse}',
)
Pass a table or query, the catalog, and a table name. Iceberg names are
namespace-qualified, written as namespace.table: in
analytics.image_features, analytics is the namespace (a grouping of
tables, like a schema in a SQL database) and image_features is the
table. The namespace is created automatically if it’s new.
Reading the table back returns the typed schema Iceberg stored, not just
the values:
pxt.io.export_iceberg(features, catalog, 'analytics.image_features')
# The table carries a typed schema, not just a header row of values
table = catalog.load_table('analytics.image_features')
table.schema()
Schema(NestedField(field_id=1, name=‘label’, field_type=StringType(), required=False), NestedField(field_id=2, name=‘width’, field_type=LongType(), required=False), NestedField(field_id=3, name=‘height’, field_type=LongType(), required=False), NestedField(field_id=4, name=‘mode’, field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])
# And the rows themselves (Spark, DuckDB, or Snowflake would query the same table)
table.scan().to_arrow().to_pandas()
Advanced Iceberg options
export_iceberg() has a few extra controls for writing to a real
lakehouse.
Handle existing tables. You control what happens when the target
table already exists with the if_exists parameter:
replace writes to a temporary table and swaps it in only on success,
so a failed export leaves the existing table alone. (export_csv and
export_json simply overwrite the file.)
# image_features already holds the 3 rows we exported above;
# appending the same 3 rows makes 6
pxt.io.export_iceberg(
features, catalog, 'analytics.image_features', if_exists='append'
)
# Reload to see the new snapshot, then count the rows
catalog.load_table('analytics.image_features').scan().to_arrow().num_rows
6
Export embeddings. Embeddings and feature vectors are pxt.Array
columns. Iceberg has no fixed-shape tensor type, so project a
fixed-shape array to a list with .to_list() first; variable-shape
arrays already map to Iceberg lists. (CSV and JSON serialize arrays as
JSON automatically.)
import numpy as np
vectors = pxt.create_table(
'export_demo/vectors',
{'doc_id': pxt.Int, 'embedding': pxt.Array[(4,), pxt.Float]}, # type: ignore[misc]
)
vectors.insert(
[
{
'doc_id': 0,
'embedding': np.array([0.1, 0.2, 0.3, 0.4], dtype=np.float32),
},
{
'doc_id': 1,
'embedding': np.array([0.5, 0.6, 0.7, 0.8], dtype=np.float32),
},
]
)
pxt.io.export_iceberg(
vectors.select(vectors.doc_id, embedding=vectors.embedding.to_list()),
catalog,
'analytics.embeddings',
)
catalog.load_table('analytics.embeddings').scan().to_arrow().to_pandas()
Created table ‘vectors’.
Inserted 2 rows with 0 errors in 0.00 s (421.55 rows/s)
Override column types. Iceberg infers types from the Pixeltable
schema. Use schema_overrides to pin a specific PyArrow type, for
example to store a 32-bit pxt.Float as a 64-bit double.
import pyarrow as pa
images.add_computed_column(aspect_ratio=images.width / images.height)
pxt.io.export_iceberg(
images.select(images.label, images.aspect_ratio),
catalog,
'analytics.image_ratios',
schema_overrides={'aspect_ratio': pa.float64()},
)
catalog.load_table('analytics.image_ratios').schema().as_arrow()
Added 3 column values with 0 errors in 0.03 s (112.61 rows/s)
label: large_string
— field metadata —
PARQUET:field_id: ‘1’
aspect_ratio: double
— field metadata —
PARQUET:field_id: ‘2’
Export a filtered query
Every export function takes a query, so you can filter rows and project
columns before writing, in any format. Here we keep only wide images and
two columns.
wide = images.where(images.width > 500).select(images.label, images.width)
pxt.io.export_csv(wide, out_dir / 'wide_images.csv')
pd.read_csv(out_dir / 'wide_images.csv')
An export carries a reference to each media file, not the bytes. A
media column gives you two references to choose from:
- Its value (e.g.
videos.video) is the local cache path on the
machine that ran the export. It resolves nowhere else.
- Its
.fileurl (e.g. videos.video.fileurl) is where the file is
stored: the https://, s3://, or gs:// address it was inserted
from, or a file:// path if you inserted a local file.
The export writes a reference, not the bytes, so it can’t make a file
reachable on its own. The recipient can get to your media only if it
already lives in storage they can access too, which in practice means
shared cloud storage (S3, GCS, R2). Media inserted from a cloud bucket
or URL exports an address anyone with access can open; media inserted
from a local file exports a file:// path that resolves only on the
machine that ran the export. To share local media, upload it to cloud
storage first (see Upload media to S3 and cloud
storage).
How each format handles media:
export_csv() and export_json() export .fileurl for you.
export_iceberg() inlines Image columns as bytes, but writes the
local cache path for Video, Audio, and Document. Select
.fileurl yourself when the table needs to travel.
The video below is inserted from a URL, so its .fileurl is that URL,
and CSV exports it directly:
videos = pxt.create_table(
'export_demo/videos', {'video': pxt.Video, 'title': pxt.String}
)
videos.insert(
[
{
'video': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/tests/data/videos/bangkok_half_res.mp4',
'title': 'bangkok',
}
]
)
Created table ‘videos’.
Inserted 1 row with 0 errors in 0.65 s (1.53 rows/s)
1 row inserted.
pxt.io.export_csv(
videos.select(videos.title, videos.video), out_dir / 'videos.csv'
)
# to_dict keeps the full URL visible (a DataFrame would truncate it)
pd.read_csv(out_dir / 'videos.csv').to_dict('records')
[{‘title’: ‘bangkok’,
‘video’: ‘https://raw.githubusercontent.com/pixeltable/pixeltable/main/tests/data/videos/bangkok_half_res.mp4'\}\]
Exported directly to Iceberg, the same column becomes a machine-local
path:
pxt.io.export_iceberg(
videos.select(videos.video), catalog, 'analytics.video_local'
)
catalog.load_table('analytics.video_local').scan().to_arrow().to_pylist()
[{‘video’: ‘/Users/username/.pixeltable/file_cache/0eabea0a34ca471ea07a153f6485d552_0_b2eb75710e902d63b0b8ff19dc432062618731a0ec7df6f68beee7ad5fac98be.mp4’}]
Export .fileurl instead and Iceberg gets the portable URL too. With
media in S3 or R2, this is the s3:///gs:// URL a lakehouse engine
can open:
videos.add_computed_column(duration=pxtf.video.get_duration(videos.video))
pxt.io.export_iceberg(
videos.select(
videos.title, videos.duration, video_url=videos.video.fileurl
),
catalog,
'analytics.videos',
)
catalog.load_table('analytics.videos').scan().to_arrow().to_pylist()
Added 1 column value with 0 errors in 0.04 s (28.16 rows/s)
[{‘title’: ‘bangkok’,
‘duration’: 18.479999542236328,
‘video_url’: ‘https://raw.githubusercontent.com/pixeltable/pixeltable/main/tests/data/videos/bangkok_half_res.mp4'\}\]
Explanation
Every column has a Pixeltable
type, from scalars
like Int and Float to the multimodal types (Image, Video,
Audio, Document, Array, Json) that make it possible to store
images, video, and embeddings in the same table. On export, each type
maps to the closest thing the destination format supports:
Iceberg details:
- Rows are streamed as PyArrow
RecordBatches; tune the in-memory batch
size with batch_size_bytes (default 128 MB) for large exports.
- Pass
namespace.table; the namespace is created automatically, and a
bare name is rejected.
- A fixed-shape
pxt.Array is rejected (project it with .to_list()),
as is a pxt.Json column that can’t reduce to one concrete type. Pin
those with schema_overrides.
See also