Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Problem

You need to iterate on transformation logic before running it on your entire dataset—especially for expensive operations like API calls or model inference.

Solution

What’s in this recipe: - Test transformations on sample rows before applying to your full dataset - Save expressions as variables to guarantee consistent logic - Apply the iterate-then-add workflow with built-in functions, expressions, and custom UDFs You test transformation logic on sample rows before processing your entire dataset using the iterate-then-add workflow. This lets you validate logic on a few rows before committing to your full table. You use .select() with .collect() to preview transformations—nothing is stored in your table. If you want to collect only the first few rows, use .head(n) instead of .collect(). Once you’re satisfied with the results, use .add_computed_column() with the same expression to persist the transformation across your full table. This workflow applies to any data type in Pixeltable: images, videos, audio files, documents, and structured tabular data. This recipe uses text data and shows three examples:
  1. Testing built-in functions on sample data
  2. Saving expressions as variables to ensure consistency
  3. Iterating with custom user-defined functions (UDFs)

Setup

%pip install -qU pixeltable
import pixeltable as pxt

Create sample data

# Create a fresh directory (drop existing if present)
pxt.drop_dir('demo_project', force=True)
pxt.create_dir('demo_project')
Created directory ‘demo_project’.
<pixeltable.catalog.dir.Dir at 0x157b96250>
t = pxt.create_table('demo_project.lyrics', {'text': pxt.String})
Created table ‘lyrics’.
t.insert([
    {'text': 'Tumble out of bed and I stumble to the kitchen'},
    {'text': 'Pour myself a cup of ambition'},
    {'text': 'And yawn and stretch and try to come to life'},
    {'text': "Jump in the shower and the blood starts pumpin'"},
    {'text': "Out on the street, the traffic starts jumpin'"},
    {'text': 'With folks like me on the job from nine to five'}
])
Inserting rows into `lyrics`: 6 rows [00:00, 2755.78 rows/s]
Inserted 6 rows with 0 errors.
6 rows inserted, 6 values computed.

Example 1: Built-in functions

Iterate with built-in functions, then add to the table.
# Test uppercase transformation on subset
t.select(
    t.text,
    uppercase=t.text.upper()
).head(2)
# Confirm the transformation was only in memory—table unchanged
t.head(2)
# Apply to all rows (same expression)
t.add_computed_column(uppercase=t.text.upper())
Added 6 column values with 0 errors.
6 rows updated, 12 values computed.
# View text with uppercase column
t.collect()

Example 2: Save and reuse expressions

Save an expression as a variable to guarantee the same logic in both iterate and add steps.
# Define the expression once - no duplication
char_count_expr = t.text.len()

# Iterate: Test on subset
t.select(
    t.text,
    char_count=char_count_expr
).head(2)
# Confirm the transformation was only in memory—table unchanged
t.head(2)
# Add: Use the SAME expression to persist
t.add_computed_column(char_count=char_count_expr)
Added 6 column values with 0 errors.
6 rows updated, 12 values computed.
# View text with char_count column
t.collect()
This pattern works with any expression: - Built-in functions: resize_expr = t.image.resize((224, 224)) - UDFs: watermark_expr = add_watermark(t.image, '© 2024') - Chained operations: processed_expr = t.image.resize((224, 224)).rotate(90) Benefits: - Write the expression once, use it twice - No copy-paste—reuse the same logic - Easy to iterate: change in one place, test again

Example 3: Custom UDF

Iterate with a user-defined function, then add to the table.
# Define a custom transformation
@pxt.udf
def word_count(text: str) -> int:
    return len(text.split())
# Iterate: Test UDF on subset
t.select(
    t.text,
    word_count=word_count(t.text)
).head(2)
# Confirm the transformation was only in memory—table unchanged
t.head(2)
# Add: Apply to all rows (same expression)
t.add_computed_column(word_count=word_count(t.text))
Added 6 column values with 0 errors.
6 rows updated, 12 values computed.
# View text with word_count column
t.collect()

Explanation

How the iterate-then-add workflow works: Queries and computed columns serve different purposes. Queries let you test transformations on sample rows without storing anything. Once you’re satisfied with the results, you use the exact same expression with .add_computed_column() to persist it across your entire table. This workflow is especially valuable for expensive operations—API calls, model inference, complex image processing—where you want to validate logic before processing your full dataset. Test on 2-3 rows to catch errors early, then commit once. To customize this workflow:
  • Sample size: Use .head(n) to collect only the first n rows—.head(1) for single-row testing, .head(10) for broader validation, or .collect() to collect all rows
  • Save expressions: Store transformations as variables (Example 2) to guarantee identical logic in both iterate and add steps
  • Chain transformations: Test multiple operations together—.select(t.text.upper().split()) works just like single operations
  • Use with any data type: This pattern works with images, videos, audio, documents—not just text. For multimodal data, visual inspection during iteration is especially valuable
The Pixeltable workflow: In traditional databases, .select() just picks which columns to view. In Pixeltable, .select() also lets you compute new transformations on the fly—define new columns without storing them. This makes .select() perfect for testing transformations before you commit them. When you use .select(), you’re creating a query. Queries are temporary operations that retrieve and transform data from tables—they don’t store anything. Queries use lazy evaluation, meaning they don’t execute until you call .collect(). You must use .collect() to execute the query and return results. .head(n) is a convenience method that collects only the first n rows instead of all rows. Use .head(n) when iterating to get fast feedback without processing your entire dataset. Nothing is stored in your table when you run queries. You can test different approaches quickly without affecting your data. You can store query results in a Python variable to work with them in your session.
# Store query results as a variable (in memory only)
results = t.select(
    t.text,
    uppercase=t.text.upper()  # Label the transformed column
).head(3)
These results are stored in memory and will not persist across sessions—only .add_computed_column() persists data to your table. Once you’re satisfied, .add_computed_column() uses the same expression but adds it as a persistent column in your table. Now the transformation runs on all rows and results are stored permanently.

See also