This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Problem
You need to iterate on transformation logic before running it on your
entire dataset—especially for expensive operations like API calls or
model inference.
Solution
What’s in this recipe: - Test transformations on sample rows before
applying to your full dataset - Save expressions as variables to
guarantee consistent logic - Apply the iterate-then-add workflow with
built-in functions, expressions, and custom UDFs
You test transformation logic on sample rows before processing your
entire dataset using the iterate-then-add workflow. This lets you
validate logic on a few rows before committing to your full table.
You use .select() with .collect() to preview transformations—nothing
is stored in your table. If you want to collect only the first few rows,
use .head(n) instead of .collect(). Once you’re satisfied with the
results, use .add_computed_column() with the same expression to
persist the transformation across your full table.
This workflow applies to any data type in Pixeltable: images, videos,
audio files, documents, and structured tabular data. This recipe uses
text data and shows three examples:
- Testing built-in functions on sample data
- Saving expressions as variables to ensure consistency
- Iterating with custom user-defined functions (UDFs)
Setup
%pip install -qU pixeltable
Create sample data
# Create a fresh directory (drop existing if present)
pxt.drop_dir('demo_project', force=True)
pxt.create_dir('demo_project')
Created directory ‘demo_project’.
<pixeltable.catalog.dir.Dir at 0x157b96250>
t = pxt.create_table('demo_project.lyrics', {'text': pxt.String})
Created table ‘lyrics’.
t.insert([
{'text': 'Tumble out of bed and I stumble to the kitchen'},
{'text': 'Pour myself a cup of ambition'},
{'text': 'And yawn and stretch and try to come to life'},
{'text': "Jump in the shower and the blood starts pumpin'"},
{'text': "Out on the street, the traffic starts jumpin'"},
{'text': 'With folks like me on the job from nine to five'}
])
Inserting rows into `lyrics`: 6 rows [00:00, 2755.78 rows/s]
Inserted 6 rows with 0 errors.
6 rows inserted, 6 values computed.
Example 1: Built-in functions
Iterate with built-in functions, then add to the table.
# Test uppercase transformation on subset
t.select(
t.text,
uppercase=t.text.upper()
).head(2)
# Confirm the transformation was only in memory—table unchanged
t.head(2)
# Apply to all rows (same expression)
t.add_computed_column(uppercase=t.text.upper())
Added 6 column values with 0 errors.
6 rows updated, 12 values computed.
# View text with uppercase column
t.collect()
Example 2: Save and reuse expressions
Save an expression as a variable to guarantee the same logic in both
iterate and add steps.
# Define the expression once - no duplication
char_count_expr = t.text.len()
# Iterate: Test on subset
t.select(
t.text,
char_count=char_count_expr
).head(2)
# Confirm the transformation was only in memory—table unchanged
t.head(2)
# Add: Use the SAME expression to persist
t.add_computed_column(char_count=char_count_expr)
Added 6 column values with 0 errors.
6 rows updated, 12 values computed.
# View text with char_count column
t.collect()
This pattern works with any expression: - Built-in functions:
resize_expr = t.image.resize((224, 224)) - UDFs:
watermark_expr = add_watermark(t.image, '© 2024') - Chained
operations: processed_expr = t.image.resize((224, 224)).rotate(90)
Benefits: - Write the expression once, use it twice - No
copy-paste—reuse the same logic - Easy to iterate: change in one place,
test again
Example 3: Custom UDF
Iterate with a user-defined function, then add to the table.
# Define a custom transformation
@pxt.udf
def word_count(text: str) -> int:
return len(text.split())
# Iterate: Test UDF on subset
t.select(
t.text,
word_count=word_count(t.text)
).head(2)
# Confirm the transformation was only in memory—table unchanged
t.head(2)
# Add: Apply to all rows (same expression)
t.add_computed_column(word_count=word_count(t.text))
Added 6 column values with 0 errors.
6 rows updated, 12 values computed.
# View text with word_count column
t.collect()
Explanation
How the iterate-then-add workflow works:
Queries and computed columns serve different purposes. Queries let you
test transformations on sample rows without storing anything. Once
you’re satisfied with the results, you use the exact same expression
with .add_computed_column() to persist it across your entire table.
This workflow is especially valuable for expensive operations—API calls,
model inference, complex image processing—where you want to validate
logic before processing your full dataset. Test on 2-3 rows to catch
errors early, then commit once.
To customize this workflow:
- Sample size: Use
.head(n) to collect only the first n
rows—.head(1) for single-row testing, .head(10) for broader
validation, or .collect() to collect all rows
- Save expressions: Store transformations as variables (Example 2)
to guarantee identical logic in both iterate and add steps
- Chain transformations: Test multiple operations
together—
.select(t.text.upper().split()) works just like single
operations
- Use with any data type: This pattern works with images, videos,
audio, documents—not just text. For multimodal data, visual
inspection during iteration is especially valuable
The Pixeltable workflow:
In traditional databases, .select() just picks which columns to view.
In Pixeltable, .select() also lets you compute new transformations on
the fly—define new columns without storing them. This makes .select()
perfect for testing transformations before you commit them.
When you use .select(), you’re creating a query. Queries are temporary
operations that retrieve and transform data from tables—they don’t store
anything. Queries use lazy evaluation, meaning they don’t execute until
you call .collect(). You must use .collect() to execute the query
and return results. .head(n) is a convenience method that collects
only the first n rows instead of all rows. Use .head(n) when iterating
to get fast feedback without processing your entire dataset.
Nothing is stored in your table when you run queries. You can test
different approaches quickly without affecting your data. You can store
query results in a Python variable to work with them in your session.
# Store query results as a variable (in memory only)
results = t.select(
t.text,
uppercase=t.text.upper() # Label the transformed column
).head(3)
These results are stored in memory and will not persist across
sessions—only .add_computed_column() persists data to your table.
Once you’re satisfied, .add_computed_column() uses the same expression
but adds it as a persistent column in your table. Now the transformation
runs on all rows and results are stored permanently.
See also