Documentation Index
Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
Use this file to discover all available pages before exploring further.
Problem
You need to iterate on transformation logic before running it on your entire dataset—especially for expensive operations like API calls or model inference.Solution
What’s in this recipe:- Test transformations on sample rows before applying to your full dataset
- Save expressions as variables to guarantee consistent logic
- Apply the iterate-then-add workflow with built-in functions, expressions, and custom UDFs
- Annotate columns with comments and custom metadata using
ColumnSpec
.select() with .collect() to preview transformations—nothing
is stored in your table. If you want to collect only the first few rows,
use .head(n) instead of .collect(). Once you’re satisfied with the
results, use .add_computed_column() with the same expression to
persist the transformation across your full table.
This workflow applies to any data type in Pixeltable: images, videos,
audio files, documents, and structured tabular data. This recipe uses
text data and shows three examples:
- Testing built-in functions on sample data
- Saving expressions as variables to ensure consistency
- Iterating with custom user-defined functions (UDFs)
Setup
Create sample data
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘demo_project’.
<pixeltable.catalog.dir.Dir at 0x15315e1d0>
Created table ‘lyrics’.
Inserted 6 rows with 0 errors in 0.01 s (916.65 rows/s)
6 rows inserted.
Example 1: built-in functions
Iterate with built-in functions, then add to the table.Added 6 column values with 0 errors in 0.04 s (158.08 rows/s)
6 rows updated.
Example 2: save and reuse expressions
Save an expression as a variable to guarantee the same logic in both iterate and add steps.Added 6 column values with 0 errors in 0.02 s (348.64 rows/s)
6 rows updated.
- Built-in functions:
resize_expr = t.image.resize((224, 224)) - UDFs:
watermark_expr = add_watermark(t.image, '© 2024') - Chained operations:
processed_expr = t.image.resize((224, 224)).rotate(90)
- Write the expression once, use it twice
- No copy-paste—reuse the same logic
- Easy to iterate: change in one place, test again
Example 3: custom UDF
Iterate with a user-defined function, then add to the table.Added 6 column values with 0 errors in 0.02 s (312.11 rows/s)
6 rows updated.
Example 4: annotate columns with metadata
UseColumnSpec to attach a comment or custom metadata when adding
columns. Comments appear in describe() output, while custom_metadata
stores arbitrary data (tags, version info, config) that you can retrieve
with get_metadata().
Explanation
How the iterate-then-add workflow works: Queries and computed columns serve different purposes. Queries let you test transformations on sample rows without storing anything. Once you’re satisfied with the results, you use the exact same expression with.add_computed_column() to persist it across your entire table.
This workflow is especially valuable for expensive operations—API calls,
model inference, complex image processing—where you want to validate
logic before processing your full dataset. Test on 2-3 rows to catch
errors early, then commit once.
To customize this workflow:
- Sample size: Use
.head(n)to collect only the first n rows—.head(1)for single-row testing,.head(10)for broader validation, or.collect()to collect all rows - Save expressions: Store transformations as variables (Example 2) to guarantee identical logic in both iterate and add steps
- Chain transformations: Test multiple operations
together—
.select(t.text.upper().split())works just like single operations - Use with any data type: This pattern works with images, videos, audio, documents—not just text. For multimodal data, visual inspection during iteration is especially valuable
.select() just picks which columns to view.
In Pixeltable, .select() also lets you compute new transformations on
the fly—define new columns without storing them. This makes .select()
perfect for testing transformations before you commit them.
When you use .select(), you’re creating a query. Queries are temporary
operations that retrieve and transform data from tables—they don’t store
anything. Queries use lazy evaluation, meaning they don’t execute until
you call .collect(). You must use .collect() to execute the query
and return results. .head(n) is a convenience method that collects
only the first n rows instead of all rows. Use .head(n) when iterating
to get fast feedback without processing your entire dataset.
Nothing is stored in your table when you run queries. You can test
different approaches quickly without affecting your data. You can store
query results in a Python variable to work with them in your session.
.add_computed_column() persists data to your table.
Once you’re satisfied, .add_computed_column() uses the same expression
but adds it as a persistent column in your table. Now the transformation
runs on all rows and results are stored permanently.