Computed Columns

Pixeltable Fundamentals

Section 2: Computed Columns

Welcome to Section 2 of the Pixeltable Fundamentals tutorial, Computed Columns.

In the previous section, Tables and Data Operations, we learned how to create tables, populate them with data, and query and manipulate their contents. In this section, we'll introduce one of Pixeltable's most essential and powerful concepts: computed columns. We'll learn how to:

Add computed columns to a table
Use computed columns for complex operations such as image processing and model inference

If you are running this tutorial in Colab:
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the Runtime -> Change runtime type menu item at the top, then select the GPU radio button and click on Save.

Next, let's ensure the Pixeltable library is installed in your environment, along with the Huggingface transformers library, which we'll need for this tutorial section.

%pip install -qU pixeltable torch transformers

Computed Columns

Let's start with a simple example that illustrates the basic concepts behind computed columns. We'll use a table of world population data for our example. Remember that you can import datasets into a Pixeltable table by providing a URL or file path to pxt.io.import_csv().

import pixeltable as pxt

pxt.drop_dir('fundamentals', force=True)
pxt.create_dir('fundamentals')
pop_t = pxt.io.import_csv(
    'fundamentals.population',
    'https://github.com/pixeltable/pixeltable/raw/main/docs/source/data/world-population-data.csv'
)

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `fundamentals`.
Created table `population`.
Inserting rows into `population`: 234 rows [00:00, 18059.27 rows/s]
Inserted 234 rows with 0 errors.

Also recall that pop_t.head() returns the first few rows of a table, and typing the table name pop_t by itself gives the schema.

pop_t.head(5)

cca3	country	continent	pop_2023	pop_2022	pop_2000	area__km__
IND	India	Asia	1428627663	1417173173	1059633675	3287590.
CHN	China	Asia	1425671352	1425887337	1264099069	9706961.
USA	United States	North America	339996563	338289857	282398554	9372610.
IDN	Indonesia	Asia	277534122	275501339	214072421	1904569.
PAK	Pakistan	Asia	240485658	235824862	154369924	881912.

pop_t

Column Name	Type	Computed With
cca3	string
country	string
continent	string
pop_2023	int
pop_2022	int
pop_2000	int
area__km__	float

Now let's suppose we want to add a new column for the year-over-year population change from 2022 to 2023. In the previous tutorial section, Tables and Data Operations, we saw how one might select() such a quantity into a Pixeltable DataFrame, giving it the name yoy_change (year-over-year change):

pop_t.select(pop_t.country, yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).head(5)

country	yoy_change
India	11454490
China	-215985
United States	1706706
Indonesia	2032783
Pakistan	4660796

A computed column is a way of turning such a selection into a new, permanent column of the table. Here's how it works:

pop_t.add_computed_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))

Added 234 column values with 0 errors.
UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])

As soon as the column is added, Pixeltable will (by default) automatically compute its value for all rows in the table, storing the results in the new column. If we now inspect the schema of pop_t, we see the new column and its definition.

pop_t

Column Name	Type	Computed With
cca3	string
country	string
continent	string
pop_2023	int
pop_2022	int
pop_2000	int
area__km__	float
yoy_change	int	pop_2023 - pop_2022

The new column can be queried in the usual manner.

pop_t.select(pop_t.country, pop_t.yoy_change).head(5)

country	yoy_change
India	11454490
China	-215985
United States	1706706
Indonesia	2032783
Pakistan	4660796

The output is identical to the previous example, but now we're retrieving the computed output from the database, instead of computing it on-the-fly.

Computed columns can be "chained" with other computed columns. Here's an example that expresses population change as a percentage:

pop_t.add_computed_column(
    yoy_percent_change=(100 * pop_t.yoy_change / pop_t.pop_2022)
)

Added 234 column values with 0 errors.
UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])

pop_t

Column Name	Type	Computed With
cca3	string
country	string
continent	string
pop_2023	int
pop_2022	int
pop_2000	int
area__km__	float
yoy_change	int	pop_2023 - pop_2022
yoy_percent_change	float	(100 * yoy_change) / pop_2022

pop_t.select(pop_t.country, pop_t.yoy_change, pop_t.yoy_percent_change).head(5)

country	yoy_change	yoy_percent_change
India	11454490	0.808
China	-215985	-0.015
United States	1706706	0.505
Indonesia	2032783	0.738
Pakistan	4660796	1.976

Although computed columns appear superficially similar to DataFrames, there is a key difference. Because computed columns are a permanent part of the table, they will be automatically updated any time new data is added to the table. These updates will propagate through any other computed columns that are "downstream" of the new data, ensuring that the state of the entire data is kept up-to-date.

📘
In traditional data workflows, it is commonplace to recompute entire pipelines when the input dataset is changed or enlarged. In Pixeltable, by contrast, all updates are applied incrementally. When new data appear in a table or existing data are altered, Pixeltable will recompute only those rows that are dependent on the changed data.

Let's see how this works in practice. For purposes of illustration, we'll add an entry for California to the table, as if it were a country.

pop_t.insert(
    country='California',
    pop_2023=39110000,
    pop_2022=39030000,
)

Computing cells:   0%|                                                    | 0/5 [00:00<?, ? cells/s]
Inserting rows into `population`: 1 rows [00:00, 253.11 rows/s]
Computing cells: 100%|███████████████████████████████████████████| 5/5 [00:00<00:00, 648.51 cells/s]
Inserted 1 row with 0 errors.
UpdateStatus(num_rows=1, num_computed_values=5, num_excs=0, updated_cols=[], cols_with_excs=[])

Observe that the computed columns yoy_growth and yoy_percent_growth have been automatically updated in response to the new data.

pop_t.tail(5)

cca3	country	continent	pop_2023	pop_2022	pop_2000	area__km__	yoy_change	yoy_percent_change
FLK	Falkland Islands	South America	3791	3780	3080.0	12173.	11	0.291
NIU	Niue	Oceania	1935	1934	2074.0	261.	1	0.052
TKL	Tokelau	Oceania	1893	1871	1666.0	12.	22	1.176
VAT	Vatican City	Europe	518	510	651.0	0.44	8	1.569
None	California	None	39110000	39030000	NaN	NaN	80000	0.205

📘
Remember that all tables in Pixeltable are persistent. This includes computed columns: when you create a computed column, its definition is stored in the database. You can think of computed columns as setting up a persistent compute workflow: if you close your notebook or restart your Python instance, computed columns (along with the relationships between them, and any data contained in them) will be preserved.

A More Complex Example: Image Processing

In the Tables and Data Operations tutorial, we saw how media data such as images can be inserted into Pixeltable tables, alongside more traditional structured data. Let's explore another example that uses computed columns for image processing operations.

In this example, we'll create the table directly by providing a schema, rather than importing it from a CSV like before.

t = pxt.create_table('fundamentals.image_ops', {'source': pxt.Image})

Created table `image_ops`.

url_prefix = 'https://github.com/pixeltable/pixeltable/raw/main/docs/source/data/images'
images = ['000000000139.jpg', '000000000632.jpg', '000000000872.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in images)

Inserting rows into `image_ops`: 3 rows [00:00, 974.44 rows/s]
Inserted 3 rows with 0 errors.
UpdateStatus(num_rows=3, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])

t.collect()

What are some things we might want to do with these images? A fairly basic one is to extract metadata. Pixeltable provides the built-in UDF get_metadata(), which returns a dictionary with various metadata about the image. Let's go ahead and make this a computed column.

"UDF" is standard terminology in databases, meaning "User-Defined Function". Technically speaking, the get_metadata() function isn't user-defined, it's built in to the Pixeltable library. But we'll consistently refer to Pixeltable functions as "UDFs" in order to clearly distinguish them from ordinary Python functions. Later in this tutorial, we'll see how to turn (almost) any Python function into a Pixeltable UDF.

t.add_computed_column(metadata=t.source.get_metadata())
t.collect()

Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 116.21 cells/s]
Added 3 column values with 0 errors.

Image operations, of course, can also return new images. In the next example, we use a different notation for adding a new computed column, which is sometimes more convenient. Its meaning is exactly identical to add_column().

t.add_computed_column(rotated=t.source.rotate(10))

Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 11.94 cells/s]
Added 3 column values with 0 errors.

t.collect()

Or, perhaps we want to rotate our images and fill them in with a transparent background rather than black. We can do this by chaining image operations, adding a transparency layer before doing the rotation.

t.add_computed_column(rotated_transparent=t.source.convert('RGBA').rotate(10))
t.collect()

Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:00<00:00,  8.45 cells/s]
Added 3 column values with 0 errors.

ℹ️
In addition to get_metadata(), convert(), and rotate(), Pixeltable has a sizable library of other common image operations that can be used as UDFs in computed columns. For the most part, the image UDFs are analogs of the operations provided by the Pillow library (in fact, Pixeltable is just using Pillow under the covers). You can read more about the provided image (and other) UDFs in the Pixeltable API Documentation.

Let's have a look at our table schema.

Column Name	Type	Computed With
source	image
metadata	json	source.get_metadata()
rotated	image	source.rotate(10)
rotated_transparent	image(mode=RGBA)	source.convert('RGBA').rotate(10)

Image Detection

In addition to simple operations like rotate() and convert(), the Pixeltable API includes UDFs for various off-the-shelf image models. Let's look at one example: object detection using the ResNet-50 model. Model inference is a UDF too, and it can be inserted into a computed column like any other.

This one may take a little more time to compute, since it involves first downloading the ResNet-50 model (if it isn't already cached), then running inference on the images in our table.

from pixeltable.functions.huggingface import detr_for_object_detection

t.add_computed_column(detections=detr_for_object_detection(
    t.source,
    model_id='facebook/detr-resnet-50',
    threshold=0.8
))

Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:01<00:00,  2.41 cells/s]
Added 3 column values with 0 errors.

t.select(t.source, t.detections).collect()

It's great that the DETR model gave us so much information about the images, but it's not exactly in human-readable form. Those are JSON structures that encode bounding boxes, confidence scores, and categories for each detected object. Let's do something more useful with them: we'll use Pixeltable's draw_bounding_boxes() API to superimpose bounding boxes on the images, using different colors to distinguish different object categories.

from pixeltable.functions.vision import draw_bounding_boxes

t.add_computed_column(image_with_bb=draw_bounding_boxes(
    t.source, t.detections.boxes, t.detections.label_text, fill=True
))
t.select(t.source, t.image_with_bb).collect()

Computing cells: 100%|████████████████████████████████████████████| 3/3 [00:00<00:00,  9.46 cells/s]
Added 3 column values with 0 errors.

It can be a little hard to see what's going on, so let's zoom in on just one image. If you select a single image in a notebook, Pixeltable will enlarge its display:

t.select(t.image_with_bb).head(1)

Let's check in on our schema. We now have five computed columns, all derived from the single source column.

Column Name	Type	Computed With
source	image
metadata	json	source.get_metadata()
rotated	image	source.rotate(10)
rotated_transparent	image(mode=RGBA)	source.convert('RGBA').rotate(10)
detections	json	detr_for_object_detection(source, threshold=0.8, model_id='facebook/detr-resnet-50')
image_with_bb	image	draw_bounding_boxes(source, detections.boxes, detections.label_text, fill=True)

And as always, when we add new data to the table, its computed columns are updated automatically. Let's try this on a few more images.

more_images = ['000000000108.jpg', '000000000885.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in more_images)

Computing cells: 100%|██████████████████████████████████████████| 10/10 [00:01<00:00,  6.69 cells/s]
Inserting rows into `image_ops`: 2 rows [00:00, 1247.56 rows/s]
Computing cells: 100%|██████████████████████████████████████████| 10/10 [00:01<00:00,  5.55 cells/s]
Inserted 2 rows with 0 errors.
UpdateStatus(num_rows=2, num_computed_values=10, num_excs=0, updated_cols=[], cols_with_excs=[])

t.select(t.source, t.image_with_bb, t.detections.label_text, t.metadata).tail(2)

It bears repeating that Pixeltable is persistent! Anything you put into a table, including computed columns, will be saved in persistent storage. This includes inference outputs such as t.detections, as well as generated images such as t.image_with_bb. (Later we'll see how to tune this behavior in cases where it might be undesirable to store everything, but the default behavior is that computed column output is always persisted.)

Expressions

Let's have a closer look at that call to draw_bounding_boxes() in the last example.

draw_bounding_boxes(t.source, t.detections.boxes, t.detections.label_text, fill=True)

There are a couple of things going on. draw_bounding_boxes() is, of course, a UDF, and its first argument is a column reference of the sort we've used many times now: t.source, the source image. The other two arguments are more than simple column references, though: they're compound expressions that include the column reference t.detections along with a suffix (.boxes or .label_text) that tells Pixeltable to look inside the dictionary stored in t.detections.

These are all examples of Pixeltable expressions. In fact, we've seen other types of Pixeltable expressions as well, without explicitly calling them out:

Calls to a UDF are expressions, such as t.source.rotate(10), or the draw_bounding_boxes() example above;
Arithmetic operations are expressions, such as year-over-year calculation in our first example: 100 * pop_t.yoy_change / pop_t.pop_2022.