Computed Columns

Kaggle Colab Download Notebook

Pixeltable Fundamentals

Section 2: Computed Columns

Welcome to Section 2 of the Pixeltable Fundamentals tutorial, Computed Columns.

In the previous section, Tables and Data Operations, we learned how to create tables, populate them with data, and query and manipulate their contents. In this section, we'll introduce one of Pixeltable's most essential and powerful concepts: computed columns. We'll learn how to:

  • Add computed columns to a table
  • Use computed columns for complex operations such as image processing and model inference

If you are running this tutorial in Colab:
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the Runtime -> Change runtime type menu item at the top, then select the GPU radio button and click on Save.

Next, let's ensure the Pixeltable library is installed in your environment, along with the Huggingface transformers library, which we'll need for this tutorial section.

%pip install -qU pixeltable torch transformers

Computed Columns

Let's start with a simple example that illustrates the basic concepts behind computed columns. We'll use a table of world population data for our example. Remember that you can import datasets into a Pixeltable table by providing a URL or file path to pxt.io.import_csv().

import pixeltable as pxt

pxt.drop_dir('fundamentals', force=True)
pxt.create_dir('fundamentals')
pop_t = pxt.io.import_csv(
    'fundamentals.population',
    'https://github.com/pixeltable/pixeltable/raw/main/docs/source/data/world-population-data.csv'
)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `fundamentals`.
Created table `population`.
Inserting rows into `population`: 234 rows [00:00, 18059.27 rows/s]
Inserted 234 rows with 0 errors.

Also recall that pop_t.head() returns the first few rows of a table, and typing the table name pop_t by itself gives the schema.

pop_t.head(5)
cca3 country continent pop_2023 pop_2022 pop_2000 area__km__
IND India Asia 1428627663 1417173173 1059633675 3287590.
CHN China Asia 1425671352 1425887337 1264099069 9706961.
USA United States North America 339996563 338289857 282398554 9372610.
IDN Indonesia Asia 277534122 275501339 214072421 1904569.
PAK Pakistan Asia 240485658 235824862 154369924 881912.
pop_t
Column Name Type Computed With
cca3 string
country string
continent string
pop_2023 int
pop_2022 int
pop_2000 int
area__km__ float

Now let's suppose we want to add a new column for the year-over-year population change from 2022 to 2023. In the previous tutorial section, Tables and Data Operations, we saw how one might select() such a quantity into a Pixeltable DataFrame, giving it the name yoy_change (year-over-year change):

pop_t.select(pop_t.country, yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).head(5)
country yoy_change
India 11454490
China -215985
United States 1706706
Indonesia 2032783
Pakistan 4660796

A computed column is a way of turning such a selection into a new, permanent column of the table. Here's how it works:

pop_t.add_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))
Added 234 column values with 0 errors.
UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])

As soon as the column is added, Pixeltable will (by default) automatically compute its value for all rows in the table, storing the results in the new column. If we now inspect the schema of pop_t, we see the new column and its definition.

pop_t
Column Name Type Computed With
cca3 string
country string
continent string
pop_2023 int
pop_2022 int
pop_2000 int
area__km__ float
yoy_change int pop_2023 - pop_2022

The new column can be queried in the usual manner.

pop_t.select(pop_t.country, pop_t.yoy_change).head(5)
country yoy_change
India 11454490
China -215985
United States 1706706
Indonesia 2032783
Pakistan 4660796

The output is identical to the previous example, but now we're retrieving the computed output from the database, instead of computing it on-the-fly.

Computed columns can be "chained" with other computed columns. Here's an example that expresses population change as a percentage:

pop_t.add_column(yoy_percent_change=(100 * pop_t.yoy_change / pop_t.pop_2022))
Added 234 column values with 0 errors.
UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])
pop_t
Column Name Type Computed With
cca3 string
country string
continent string
pop_2023 int
pop_2022 int
pop_2000 int
area__km__ float
yoy_change int pop_2023 - pop_2022
yoy_percent_change float (100 * yoy_change) / pop_2022
pop_t.select(pop_t.country, pop_t.yoy_change, pop_t.yoy_percent_change).head(5)
country yoy_change yoy_percent_change
India 11454490 0.808
China -215985 -0.015
United States 1706706 0.505
Indonesia 2032783 0.738
Pakistan 4660796 1.976

Although computed columns appear superficially similar to DataFrames, there is a key difference. Because computed columns are a permanent part of the table, they will be automatically updated any time new data is added to the table. These updates will propagate through any other computed columns that are "downstream" of the new data, ensuring that the state of the entire data is kept up-to-date.

πŸ“˜

In traditional data workflows, it is commonplace to recompute entire pipelines when the input dataset is changed or enlarged. In Pixeltable, by contrast, all updates are applied incrementally. When new data appear in a table or existing data are altered, Pixeltable will recompute only those rows that are dependent on the changed data.

Let's see how this works in practice. For purposes of illustration, we'll add an entry for California to the table, as if it were a country.

pop_t.insert(
    country='California',
    pop_2023=39110000,
    pop_2022=39030000,
)
Computing cells:   0%|                                                    | 0/5 [00:00<?, ? cells/s]
Inserting rows into `population`: 1 rows [00:00, 253.11 rows/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 648.51 cells/s]
Inserted 1 row with 0 errors.
UpdateStatus(num_rows=1, num_computed_values=5, num_excs=0, updated_cols=[], cols_with_excs=[])

Observe that the computed columns yoy_growth and yoy_percent_growth have been automatically updated in response to the new data.

pop_t.tail(5)
cca3 country continent pop_2023 pop_2022 pop_2000 area__km__ yoy_change yoy_percent_change
FLK Falkland Islands South America 3791 3780 3080.0 12173. 11 0.291
NIU Niue Oceania 1935 1934 2074.0 261. 1 0.052
TKL Tokelau Oceania 1893 1871 1666.0 12. 22 1.176
VAT Vatican City Europe 518 510 651.0 0.44 8 1.569
None California None 39110000 39030000 NaN NaN 80000 0.205

πŸ“˜

Remember that all tables in Pixeltable are persistent. This includes computed columns: when you create a computed column, its definition is stored in the database. You can think of computed columns as setting up a persistent compute workflow: if you close your notebook or restart your Python instance, computed columns (along with the relationships between them, and any data contained in them) will be preserved.

A More Complex Example: Image Processing

In the Tables and Data Operations tutorial, we saw how media data such as images can be inserted into Pixeltable tables, alongside more traditional structured data. Let's explore another example that uses computed columns for image processing operations.

In this example, we'll create the table directly by providing a schema, rather than importing it from a CSV like before.

t = pxt.create_table('fundamentals.image_ops', {'source': pxt.ImageType()})
Created table `image_ops`.
url_prefix = 'https://github.com/pixeltable/pixeltable/raw/main/docs/source/data/images'
images = ['000000000139.jpg', '000000000632.jpg', '000000000872.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in images)
Inserting rows into `image_ops`: 3 rows [00:00, 974.44 rows/s]
Inserted 3 rows with 0 errors.
UpdateStatus(num_rows=3, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])
t.collect()

What are some things we might want to do with these images? A fairly basic one is to extract metadata. Pixeltable provides the built-in UDF get_metadata(), which returns a dictionary with various metadata about the image. Let's go ahead and make this a computed column.

"UDF" is standard terminology in databases, meaning "User-Defined Function". Technically speaking, the get_metadata() function isn't user-defined, it's built in to the Pixeltable library. But we'll consistently refer to Pixeltable functions as "UDFs" in order to clearly distinguish them from ordinary Python functions. Later in this tutorial, we'll see how to turn (almost) any Python function into a Pixeltable UDF.
t.add_column(metadata=t.source.get_metadata())
t.collect()
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 116.21 cells/s]
Added 3 column values with 0 errors.

Image operations, of course, can also return new images. In the next example, we use a different notation for adding a new computed column, which is sometimes more convenient. Its meaning is exactly identical to add_column().

t['rotated'] = t.source.rotate(10)
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 11.94 cells/s]
Added 3 column values with 0 errors.
t.collect()

Or, perhaps we want to rotate our images and fill them in with a transparent background rather than black. We can do this by chaining image operations, adding a transparency layer before doing the rotation.

t['rotated_transparent'] = t.source.convert('RGBA').rotate(10)
t.collect()
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  8.45 cells/s]
Added 3 column values with 0 errors.

ℹ️

In addition to get_metadata(), convert(), and rotate(), Pixeltable has a sizable library of other common image operations that can be used as UDFs in computed columns. For the most part, the image UDFs are analogs of the operations provided by the Pillow library (in fact, Pixeltable is just using Pillow under the covers). You can read more about the provided image (and other) UDFs in the Pixeltable API Documentation.

Let's have a look at our table schema.

t
Column Name Type Computed With
source image
metadata json source.get_metadata()
rotated image source.rotate(10)
rotated_transparent image(mode=RGBA) source.convert('RGBA').rotate(10)

Image Detection

In addition to simple operations like rotate() and convert(), the Pixeltable API includes UDFs for various off-the-shelf image models. Let's look at one example: object detection using the ResNet-50 model. Model inference is a UDF too, and it can be inserted into a computed column like any other.

This one may take a little more time to compute, since it involves first downloading the ResNet-50 model (if it isn't already cached), then running inference on the images in our table.

from pixeltable.functions.huggingface import detr_for_object_detection

t['detections'] = detr_for_object_detection(t.source, model_id='facebook/detr-resnet-50', threshold=0.8)
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.41 cells/s]
Added 3 column values with 0 errors.
t.select(t.source, t.detections).collect()

It's great that the DETR model gave us so much information about the images, but it's not exactly in human-readable form. Those are JSON structures that encode bounding boxes, confidence scores, and categories for each detected object. Let's do something more useful with them: we'll use Pixeltable's draw_bounding_boxes() API to superimpose bounding boxes on the images, using different colors to distinguish different object categories.

from pixeltable.functions.vision import draw_bounding_boxes

t['image_with_bb'] = draw_bounding_boxes(t.source, t.detections.boxes, t.detections.label_text, fill=True)
t.select(t.source, t.image_with_bb).collect()
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  9.46 cells/s]
Added 3 column values with 0 errors.

It can be a little hard to see what's going on, so let's zoom in on just one image. If you select a single image in a notebook, Pixeltable will enlarge its display:

t.select(t.image_with_bb).head(1)

Let's check in on our schema. We now have five computed columns, all derived from the single source column.

t
Column Name Type Computed With
source image
metadata json source.get_metadata()
rotated image source.rotate(10)
rotated_transparent image(mode=RGBA) source.convert('RGBA').rotate(10)
detections json detr_for_object_detection(source, threshold=0.8, model_id='facebook/detr-resnet-50')
image_with_bb image draw_bounding_boxes(source, detections.boxes, detections.label_text, fill=True)

And as always, when we add new data to the table, its computed columns are updated automatically. Let's try this on a few more images.

more_images = ['000000000108.jpg', '000000000885.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in more_images)
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:01<00:00,  6.69 cells/s]
Inserting rows into `image_ops`: 2 rows [00:00, 1247.56 rows/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:01<00:00,  5.55 cells/s]
Inserted 2 rows with 0 errors.
UpdateStatus(num_rows=2, num_computed_values=10, num_excs=0, updated_cols=[], cols_with_excs=[])
t.select(t.source, t.image_with_bb, t.detections.label_text, t.metadata).tail(2)

It bears repeating that Pixeltable is persistent! Anything you put into a table, including computed columns, will be saved in persistent storage. This includes inference outputs such as t.detections, as well as generated images such as t.image_with_bb. (Later we'll see how to tune this behavior in cases where it might be undesirable to store everything, but the default behavior is that computed column output is always persisted.)

Expressions

Let's have a closer look at that call to draw_bounding_boxes() in the last example.

draw_bounding_boxes(t.source, t.detections.boxes, t.detections.label_text, fill=True)

There are a couple of things going on. draw_bounding_boxes() is, of course, a UDF, and its first argument is a column reference of the sort we've used many times now: t.source, the source image. The other two arguments are more than simple column references, though: they're compound expressions that include the column reference t.detections along with a suffix (.boxes or .label_text) that tells Pixeltable to look inside the dictionary stored in t.detections.

These are all examples of Pixeltable expressions. In fact, we've seen other types of Pixeltable expressions as well, without explicitly calling them out:

  • Calls to a UDF are expressions, such as t.source.rotate(10), or the draw_bounding_boxes() example above;
  • Arithmetic operations are expressions, such as year-over-year calculation in our first example: 100 * pop_t.yoy_change / pop_t.pop_2022.