Computed Columns
Pixeltable Fundamentals
Section 2: Computed Columns
Welcome to Section 2 of the Pixeltable Fundamentals tutorial, Computed Columns.
In the previous section, Tables and Data Operations, we learned how to create tables, populate them with data, and query and manipulate their contents. In this section, we'll introduce one of Pixeltable's most essential and powerful concepts: computed columns. We'll learn how to:
- Add computed columns to a table
- Use computed columns for complex operations such as image processing and model inference
If you are running this tutorial in Colab:
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the Runtime -> Change runtime type
menu item at the top, then select the GPU
radio button and click on Save
.
Next, let's ensure the Pixeltable library is installed in your environment, along with the Huggingface transformers
library, which we'll need for this tutorial section.
%pip install -qU pixeltable torch transformers
Computed Columns
Let's start with a simple example that illustrates the basic concepts behind computed columns. We'll use a table of world population data for our example. Remember that you can import datasets into a Pixeltable table by providing a URL or file path to pxt.io.import_csv()
.
import pixeltable as pxt
pxt.drop_dir('fundamentals', force=True)
pxt.create_dir('fundamentals')
pop_t = pxt.io.import_csv(
'fundamentals.population',
'https://github.com/pixeltable/pixeltable/raw/main/docs/source/data/world-population-data.csv'
)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `fundamentals`.
Created table `population`.
Inserting rows into `population`: 234 rows [00:00, 18059.27 rows/s]
Inserted 234 rows with 0 errors.
Also recall that pop_t.head()
returns the first few rows of a table, and typing the table name pop_t
by itself gives the schema.
pop_t.head(5)
cca3 | country | continent | pop_2023 | pop_2022 | pop_2000 | area__km__ |
---|---|---|---|---|---|---|
IND | India | Asia | 1428627663 | 1417173173 | 1059633675 | 3287590. |
CHN | China | Asia | 1425671352 | 1425887337 | 1264099069 | 9706961. |
USA | United States | North America | 339996563 | 338289857 | 282398554 | 9372610. |
IDN | Indonesia | Asia | 277534122 | 275501339 | 214072421 | 1904569. |
PAK | Pakistan | Asia | 240485658 | 235824862 | 154369924 | 881912. |
pop_t
Column Name | Type | Computed With |
---|---|---|
cca3 | string | |
country | string | |
continent | string | |
pop_2023 | int | |
pop_2022 | int | |
pop_2000 | int | |
area__km__ | float |
Now let's suppose we want to add a new column for the year-over-year population change from 2022 to 2023. In the previous tutorial section, Tables and Data Operations, we saw how one might select()
such a quantity into a Pixeltable DataFrame
, giving it the name yoy_change
(year-over-year change):
pop_t.select(pop_t.country, yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).head(5)
country | yoy_change |
---|---|
India | 11454490 |
China | -215985 |
United States | 1706706 |
Indonesia | 2032783 |
Pakistan | 4660796 |
A computed column is a way of turning such a selection into a new, permanent column of the table. Here's how it works:
pop_t.add_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))
Added 234 column values with 0 errors.
UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])
As soon as the column is added, Pixeltable will (by default) automatically compute its value for all rows in the table, storing the results in the new column. If we now inspect the schema of pop_t
, we see the new column and its definition.
pop_t
Column Name | Type | Computed With |
---|---|---|
cca3 | string | |
country | string | |
continent | string | |
pop_2023 | int | |
pop_2022 | int | |
pop_2000 | int | |
area__km__ | float | |
yoy_change | int | pop_2023 - pop_2022 |
The new column can be queried in the usual manner.
pop_t.select(pop_t.country, pop_t.yoy_change).head(5)
country | yoy_change |
---|---|
India | 11454490 |
China | -215985 |
United States | 1706706 |
Indonesia | 2032783 |
Pakistan | 4660796 |
The output is identical to the previous example, but now we're retrieving the computed output from the database, instead of computing it on-the-fly.
Computed columns can be "chained" with other computed columns. Here's an example that expresses population change as a percentage:
pop_t.add_column(yoy_percent_change=(100 * pop_t.yoy_change / pop_t.pop_2022))
Added 234 column values with 0 errors.
UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])
pop_t
Column Name | Type | Computed With |
---|---|---|
cca3 | string | |
country | string | |
continent | string | |
pop_2023 | int | |
pop_2022 | int | |
pop_2000 | int | |
area__km__ | float | |
yoy_change | int | pop_2023 - pop_2022 |
yoy_percent_change | float | (100 * yoy_change) / pop_2022 |
pop_t.select(pop_t.country, pop_t.yoy_change, pop_t.yoy_percent_change).head(5)
country | yoy_change | yoy_percent_change |
---|---|---|
India | 11454490 | 0.808 |
China | -215985 | -0.015 |
United States | 1706706 | 0.505 |
Indonesia | 2032783 | 0.738 |
Pakistan | 4660796 | 1.976 |
Although computed columns appear superficially similar to DataFrames, there is a key difference. Because computed columns are a permanent part of the table, they will be automatically updated any time new data is added to the table. These updates will propagate through any other computed columns that are "downstream" of the new data, ensuring that the state of the entire data is kept up-to-date.
In traditional data workflows, it is commonplace to recompute entire pipelines when the input dataset is changed or enlarged. In Pixeltable, by contrast, all updates are applied incrementally. When new data appear in a table or existing data are altered, Pixeltable will recompute only those rows that are dependent on the changed data.
Let's see how this works in practice. For purposes of illustration, we'll add an entry for California to the table, as if it were a country.
pop_t.insert(
country='California',
pop_2023=39110000,
pop_2022=39030000,
)
Computing cells: 0%| | 0/5 [00:00<?, ? cells/s]
Inserting rows into `population`: 1 rows [00:00, 253.11 rows/s]
Computing cells: 100%|βββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:00<00:00, 648.51 cells/s]
Inserted 1 row with 0 errors.
UpdateStatus(num_rows=1, num_computed_values=5, num_excs=0, updated_cols=[], cols_with_excs=[])
Observe that the computed columns yoy_growth
and yoy_percent_growth
have been automatically updated in response to the new data.
pop_t.tail(5)
cca3 | country | continent | pop_2023 | pop_2022 | pop_2000 | area__km__ | yoy_change | yoy_percent_change |
---|---|---|---|---|---|---|---|---|
FLK | Falkland Islands | South America | 3791 | 3780 | 3080.0 | 12173. | 11 | 0.291 |
NIU | Niue | Oceania | 1935 | 1934 | 2074.0 | 261. | 1 | 0.052 |
TKL | Tokelau | Oceania | 1893 | 1871 | 1666.0 | 12. | 22 | 1.176 |
VAT | Vatican City | Europe | 518 | 510 | 651.0 | 0.44 | 8 | 1.569 |
None | California | None | 39110000 | 39030000 | NaN | NaN | 80000 | 0.205 |
Remember that all tables in Pixeltable are persistent. This includes computed columns: when you create a computed column, its definition is stored in the database. You can think of computed columns as setting up a persistent compute workflow: if you close your notebook or restart your Python instance, computed columns (along with the relationships between them, and any data contained in them) will be preserved.
A More Complex Example: Image Processing
In the Tables and Data Operations tutorial, we saw how media data such as images can be inserted into Pixeltable tables, alongside more traditional structured data. Let's explore another example that uses computed columns for image processing operations.
In this example, we'll create the table directly by providing a schema, rather than importing it from a CSV like before.
t = pxt.create_table('fundamentals.image_ops', {'source': pxt.ImageType()})
Created table `image_ops`.
url_prefix = 'https://github.com/pixeltable/pixeltable/raw/main/docs/source/data/images'
images = ['000000000139.jpg', '000000000632.jpg', '000000000872.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in images)
Inserting rows into `image_ops`: 3 rows [00:00, 974.44 rows/s]
Inserted 3 rows with 0 errors.
UpdateStatus(num_rows=3, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])
t.collect()
What are some things we might want to do with these images? A fairly basic one is to extract metadata. Pixeltable provides the built-in UDF get_metadata()
, which returns a dictionary with various metadata about the image. Let's go ahead and make this a computed column.
get_metadata()
function isn't user-defined, it's built in to the Pixeltable library. But we'll consistently refer to Pixeltable functions as "UDFs" in order to clearly distinguish them from ordinary Python functions. Later in this tutorial, we'll see how to turn (almost) any Python function into a Pixeltable UDF.
t.add_column(metadata=t.source.get_metadata())
t.collect()
Computing cells: 100%|βββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 116.21 cells/s]
Added 3 column values with 0 errors.
Image operations, of course, can also return new images. In the next example, we use a different notation for adding a new computed column, which is sometimes more convenient. Its meaning is exactly identical to add_column()
.
t['rotated'] = t.source.rotate(10)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 11.94 cells/s]
Added 3 column values with 0 errors.
t.collect()
Or, perhaps we want to rotate our images and fill them in with a transparent background rather than black. We can do this by chaining image operations, adding a transparency layer before doing the rotation.
t['rotated_transparent'] = t.source.convert('RGBA').rotate(10)
t.collect()
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 8.45 cells/s]
Added 3 column values with 0 errors.
In addition to
get_metadata()
,convert()
, androtate()
, Pixeltable has a sizable library of other common image operations that can be used as UDFs in computed columns. For the most part, the image UDFs are analogs of the operations provided by the Pillow library (in fact, Pixeltable is just using Pillow under the covers). You can read more about the provided image (and other) UDFs in the Pixeltable API Documentation.
Let's have a look at our table schema.
t
Column Name | Type | Computed With |
---|---|---|
source | image | |
metadata | json | source.get_metadata() |
rotated | image | source.rotate(10) |
rotated_transparent | image(mode=RGBA) | source.convert('RGBA').rotate(10) |
Image Detection
In addition to simple operations like rotate()
and convert()
, the Pixeltable API includes UDFs for various off-the-shelf image models. Let's look at one example: object detection using the ResNet-50 model. Model inference is a UDF too, and it can be inserted into a computed column like any other.
This one may take a little more time to compute, since it involves first downloading the ResNet-50 model (if it isn't already cached), then running inference on the images in our table.
from pixeltable.functions.huggingface import detr_for_object_detection
t['detections'] = detr_for_object_detection(t.source, model_id='facebook/detr-resnet-50', threshold=0.8)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:01<00:00, 2.41 cells/s]
Added 3 column values with 0 errors.
t.select(t.source, t.detections).collect()
It's great that the DETR model gave us so much information about the images, but it's not exactly in human-readable form. Those are JSON structures that encode bounding boxes, confidence scores, and categories for each detected object. Let's do something more useful with them: we'll use Pixeltable's draw_bounding_boxes()
API to superimpose bounding boxes on the images, using different colors to distinguish different object categories.
from pixeltable.functions.vision import draw_bounding_boxes
t['image_with_bb'] = draw_bounding_boxes(t.source, t.detections.boxes, t.detections.label_text, fill=True)
t.select(t.source, t.image_with_bb).collect()
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 9.46 cells/s]
Added 3 column values with 0 errors.
It can be a little hard to see what's going on, so let's zoom in on just one image. If you select a single image in a notebook, Pixeltable will enlarge its display:
t.select(t.image_with_bb).head(1)
Let's check in on our schema. We now have five computed columns, all derived from the single source column.
t
Column Name | Type | Computed With |
---|---|---|
source | image | |
metadata | json | source.get_metadata() |
rotated | image | source.rotate(10) |
rotated_transparent | image(mode=RGBA) | source.convert('RGBA').rotate(10) |
detections | json | detr_for_object_detection(source, threshold=0.8, model_id='facebook/detr-resnet-50') |
image_with_bb | image | draw_bounding_boxes(source, detections.boxes, detections.label_text, fill=True) |
And as always, when we add new data to the table, its computed columns are updated automatically. Let's try this on a few more images.
more_images = ['000000000108.jpg', '000000000885.jpg']
t.insert({'source': f'{url_prefix}/{image}'} for image in more_images)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:01<00:00, 6.69 cells/s]
Inserting rows into `image_ops`: 2 rows [00:00, 1247.56 rows/s]
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:01<00:00, 5.55 cells/s]
Inserted 2 rows with 0 errors.
UpdateStatus(num_rows=2, num_computed_values=10, num_excs=0, updated_cols=[], cols_with_excs=[])
t.select(t.source, t.image_with_bb, t.detections.label_text, t.metadata).tail(2)
It bears repeating that Pixeltable is persistent! Anything you put into a table, including computed columns, will be saved in persistent storage. This includes inference outputs such as t.detections
, as well as generated images such as t.image_with_bb
. (Later we'll see how to tune this behavior in cases where it might be undesirable to store everything, but the default behavior is that computed column output is always persisted.)
Expressions
Let's have a closer look at that call to draw_bounding_boxes()
in the last example.
draw_bounding_boxes(t.source, t.detections.boxes, t.detections.label_text, fill=True)
There are a couple of things going on. draw_bounding_boxes()
is, of course, a UDF, and its first argument is a column reference of the sort we've used many times now: t.source
, the source image. The other two arguments are more than simple column references, though: they're compound expressions that include the column reference t.detections
along with a suffix (.boxes
or .label_text
) that tells Pixeltable to look inside the dictionary stored in t.detections
.
These are all examples of Pixeltable expressions. In fact, we've seen other types of Pixeltable expressions as well, without explicitly calling them out:
- Calls to a UDF are expressions, such as
t.source.rotate(10)
, or thedraw_bounding_boxes()
example above; - Arithmetic operations are expressions, such as year-over-year calculation in our first example:
100 * pop_t.yoy_change / pop_t.pop_2022
.
Updated 5 days ago