Tables

Learn more about Pixeltable tables and the data operations with our in-depth guide.

What are Tables?

Tables are the fundamental data storage units in Pixeltable. They function similarly to SQL database tables but with enhanced capabilities designed specifically for AI and ML workflows. Each table consists of columns with defined data types and can store both structured data and unstructured media assets. In Pixeltable, tables:

Persist across sessions, meaning your data remains available even after restarting your environment
Maintain strong typing for data consistency
Support operations like filtering, querying, and transformation
Can handle specialized data types for machine learning and media processing
Group logically into directories (namespaces) for organization

Creating a table requires defining a name and schema that describes its structure:

import pixeltable as pxt

# Create a directory to organize tables
pxt.create_dir('example')

# Create a table with a defined schema
films = pxt.create_table('example.films', {
    'title': pxt.String,
    'year': pxt.Int,
    'revenue': pxt.Float
})

Type System

# Schema definition
table = pxt.create_table('example', {
    'text': pxt.String,     # Text data
    'count': pxt.Int,       # Integer numbers
    'score': pxt.Float,     # Decimal numbers
    'active': pxt.Bool,     # Boolean values
    'created': pxt.Timestamp # Date/time values
})

Column Casting

Pixeltable allows you to explicitly cast column values to ensure they conform to the expected type. This is particularly useful when working with computed columns or transforming data from external sources.

# Cast columns to different types
table.update({
    'int_score': table.score.astype(pxt.Int),        # Cast float to integer
    'string_count': table.count.astype(pxt.String),  # Cast integer to string
})

# Using casting in computed columns
films.add_computed_column(
    budget_category=films.budget.astype(pxt.String) + ' million'
)

# Casting in expressions
films.where(films.revenue.astype(pxt.Int) > 100).collect()

Column casting helps maintain data consistency and prevents type errors when processing your data.

Data Operations

Query Operations

Filter and retrieve data:

# Basic row count
films.count()  # Returns total number of rows

# Basic filtering
films.where(films.budget >= 200.0).collect()

# Select specific columns
films.select(films.title, films.year).collect()

# Limit results
films.limit(5).collect()  # First 5 rows (no specific order)
films.head(5)  # First 5 rows by insertion order
films.tail(5)  # Last 5 rows by insertion order

# Order results
films.order_by(films.budget, asc=False).limit(5).collect()

String Operations

Manipulate text data:

# String contains
films.where(films.title.contains('Inception')).collect()

# String replacement
films.update({
    'plot': films.plot.replace('corporate secrets', 'subconscious secrets')
})

# String functions
films.update({
    'title': films.title.upper(),        # Convert to uppercase
    'length': films.title.len()          # Get string length
})

Insert Operations

Add new data:

# Insert single row
films.insert(
    title='Inside Out 2',
    year=2024,
    plot='Emotions navigate puberty',
    budget=200.0
)

# Insert multiple rows
films.insert([
    {
        'title': 'Jurassic Park', 
        'year': 1993, 
        'plot': 'Dinosaur theme park disaster',
        'budget': 63.0
    },
    {
        'title': 'Titanic', 
        'year': 1997, 
        'plot': 'Ill-fated ocean liner romance',
        'budget': 200.0
    }
])

Import from Source

Create tables or insert data directly from external sources:

import pixeltable as pxt
import pandas as pd

# Create a table from a CSV file
table = pxt.create_table('world_population', source='https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/world-population-data.csv')

# Create a table from a pandas DataFrame
df = pd.DataFrame({
    'cca3': ['FRA', 'DEU', 'ITA'],
    'country': ['France', 'Germany', 'Italy'],
    'continent': ['Europe', 'Europe', 'Europe'],
    'pop_2023': [68_000_000, 83_000_000, 59_000_000]
})
table = pxt.create_table('europe_population', source=df)

# Insert data from a pandas DataFrame into an existing table
new_df = pd.DataFrame({
    'cca3': ['ESP', 'GBR', 'POL'],
    'country': ['Spain', 'United Kingdom', 'Poland'],
    'continent': ['Europe', 'Europe', 'Europe'],
    'pop_2023': [47_000_000, 67_000_000, 38_000_000]
})
table.insert(new_df)

Pixeltable supports importing from various data sources:

CSV files (.csv)
Excel files (.xls, .xlsx)
Parquet files (.parquet, .pq, .parq)
JSON files (.json)
Pandas DataFrames
Pixeltable DataFrames
Hugging Face datasets

Update Operations

Modify existing data:

# Update all rows
films.update({
    'budget': films.budget * 1.1  # Increase all budgets by 10%
})

# Conditional updates
films.where(
    films.year < 2000
).update({
    'plot': films.plot + ' (Classic Film)'
})

# Batch updates for multiple rows
updates = [
    {'id': 1, 'budget': 175.0},
    {'id': 2, 'budget': 185.0}
]
films.batch_update(updates)

Delete Operations

Remove data with conditions:

# Delete specific rows
films.where(
    films.year < 1995
).delete()

# Delete with complex conditions
films.where(
    (films.budget < 100.0) & 
    (films.year < 2000)
).delete()

# WARNING: Delete all rows (use with caution!)
# films.delete()  # Without where clause deletes all rows

Column Operations

Manage table structure:

# Add new column
films.add_column(rating=pxt.String)

# Drop column
films.drop_column('rating')

# View schema
films.describe()

Versioning

Manage table versions:

# Revert the last operation
films.revert()  # Cannot be undone!

# Revert multiple times to go back further
films.revert()
films.revert()  # Goes back two operations

Export Operations

Extract data for analysis:

# Get results as Python objects
result = films.limit(5).collect()
first_row = result[0]  # Get first row as dict
timestamps = result['timestamp']  # Get list of values for one column

# Convert to Pandas
df = result
df['revenue'].describe()  # Get statistics for revenue column

Join Tables

Combine data from multiple tables using different join types.

import pixeltable as pxt

# Define the customers table
customers = pxt.create_table(
    "customers",
    {"customer_id": pxt.Int, "name": pxt.String, "total_spent": pxt.Float},
    if_exists="replace",
)

# Define the orders table
orders = pxt.create_table(
    "orders",
    {"order_id": pxt.Int, "customer_id": pxt.Int, "amount": pxt.Float},
    if_exists="replace",
)


# Populate the tables with sample data
customers.insert([
    {'customer_id': 1, 'name': 'Alice Johnson', 'total_spent': 250.0},
    {'customer_id': 2, 'name': 'Bob Smith', 'total_spent': 180.0},
    {'customer_id': 3, 'name': 'Carol White', 'total_spent': 320.0},
    {'customer_id': 4, 'name': 'David Brown', 'total_spent': 150.0},
    {'customer_id': 5, 'name': 'Eve Davis', 'total_spent': 90.0}
])

orders.insert([
    {'order_id': 101, 'customer_id': 1, 'amount': 75.0},
    {'order_id': 102, 'customer_id': 1, 'amount': 30.0},
    {'order_id': 103, 'customer_id': 2, 'amount': 120.0},
    {'order_id': 104, 'customer_id': 4, 'amount': 60.0}
])

Inner Join

Returns only matching records from both tables.

inner_join_result = customers.join(
    orders, 
    on=customers.customer_id == orders.customer_id, 
    how='inner'
).select(
    customers.name, 
    orders.amount
)
inner_df = inner_join_result.collect()
print(inner_df)
# Output will show only customers with matching orders (customer_id 1, 2, 4)

Left Outer Join

Returns all records from the left table and matching records from the right table.

left_join_result = customers.join(
    orders, 
    on=customers.customer_id == orders.customer_id, 
    how='left'
).select(
    customers.name, 
    orders.amount
)
left_df = left_join_result.collect()
print(left_df)
# Output will show all customers (1-5), with null for amount where no order exists

Right Outer Join

Returns all records from the right table and matching records from the left table.

right_join_result = customers.join(
    orders, 
    on=customers.customer_id == orders.customer_id, 
    how='right'
).select(
    customers.name, 
    orders.amount
)
right_df = right_join_result.collect()
print(right_df)
# Output will show all orders (order_id 101-104), with null for name where no customer exists

Cross Join

Returns all possible combinations of records from both tables.

cross_join_result = customers.join(
    orders, 
    how='cross'
).select(
    customers.name, 
    orders.amount
)
cross_df = cross_join_result.collect()
print(cross_df)
# Output will show 5 customers x 4 orders = 20 combinations

Best Practices

Schema Definition

Use clear naming for directories and tables
Document computed column dependencies

Application Code

Use get_table() to fetch existing tables
Use batch operations for multiple rows

Common Patterns

Development Workflow

Production Setup

# table.py - Run once to set up
pxt.create_table(..., if_exists="ignore")

# app.py - Production code
table = pxt.get_table("myapp.mytable")
if table is None:
    raise RuntimeError("Run table.py first!")

Additional Resources

API Documentation

Complete API reference

Examples

Sample workflows

Cheat Sheet

Quick reference

Remember that Pixeltable automatically handles versioning and lineage tracking. Every operation is recorded and can be reverted if needed.

Welcome to Pixeltable

Multimodal AI Datastore

Tutorials

Libraries

What are Tables?

Type System

Column Casting

Data Operations

Query Operations

String Operations

Insert Operations

Import from Source

Update Operations

Delete Operations

Column Operations

Versioning

Export Operations

Join Tables

Inner Join

Left Outer Join

Right Outer Join

Cross Join

Best Practices

Schema Definition

Application Code

Common Patterns

Additional Resources

API Documentation

Examples

Cheat Sheet

Welcome to Pixeltable

Multimodal AI Datastore

Tutorials

Libraries

​What are Tables?

​Type System

​Column Casting

​Data Operations

Query Operations

String Operations

Insert Operations

Import from Source

Update Operations

Delete Operations

Column Operations

Versioning

Export Operations

Join Tables

​Inner Join

​Left Outer Join

​Right Outer Join

​Cross Join

​Best Practices

Schema Definition

Application Code

​Common Patterns

​Additional Resources

API Documentation

Examples

Cheat Sheet

What are Tables?

Type System

Column Casting

Data Operations

Inner Join

Left Outer Join

Right Outer Join

Cross Join

Best Practices

Common Patterns

Additional Resources