Queries and Expressions

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Section 3: Queries and Expressions

Welcome to Section 3 of the Pixeltable Fundamentals tutorial, Queries and Expressions. In the previous section of this tutorial, Computed Columns, we saw how to issue queries over Pixeltable tables, such as:

pop_t.select(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).collect()

We also saw how to define computed columns that become part of the table and are updated automatically when new rows are inserted:

pop_t.add_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))

Both these examples use the Pixeltable expression pop_t.pop_2023 - pop_t.pop_2022. We’ve seen a number of other expressions as well, such as the chain of image operations

t.source.convert('RGBA').rotate(10)

and the model invocation

detr_for_object_detection(
    t.source,
    model_id='facebook/detr-resnet-50',
    threshold=0.8
)

Expressions are the basic building blocks of Pixeltable workloads. An expression can be included in a select() statement, which will cause it to be evaluated dynamically, or in an add_column() statement, which will add it to the table schema as a computed column. In this section, we’ll dive deeper into the different kinds of Pixeltable expressions and their uses. We’ll:

Understand the relationship between Pixeltable expressions and query execution
Survey the different types of expressions
Learn more about the Pixeltable type system

To get started, let’s import the necessary libraries for this tutorial and set up a demo directory.

%pip install -qU pixeltable datasets torch transformers

import pixeltable as pxt

pxt.drop_dir('demo', force=True)
pxt.create_dir('demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory ‘demo’.

In this tutorial we’re going to work with a subset of the MNIST dataset, a classic reference database of hand-drawn digits. A copy of the MNIST dataset is hosted on the Hugging Face datasets repository, so we can use create_table() with the source parameter to load it into a Pixeltable table.

import datasets

# Download the first 50 images of the MNIST dataset
ds = datasets.load_dataset('ylecun/mnist', split='train[:50]')

# Import them into a Pixeltable table
t = pxt.create_table('demo.mnist', source=ds)

Created table ‘mnist’.
Inserting rows into `mnist`: 50 rows [00:00, 7516.67 rows/s]
Inserted 50 rows with 0 errors.

t.head(5)

Column References

The most basic type of expression is a column reference: that’s what you get when you type, say, t.image. An expression such as t.image by itself is just a Python object; it doesn’t contain any actual data, and no data will be loaded until you use the expression in a select() query or add_column() statement. Here’s what we get if we type t.image by itself:

t.image

This is true of all Pixeltable expressions: we can freely create them and manipulate them in various ways, but no actual data will be loaded until we use them in a query.

JSON Collections (Dicts and Lists)

Data is commonly presented in JSON format: for example, API responses and model output often take the shape of JSON dictionaries or lists of dictionaries. Pixeltable has native support for JSON accessors. To demonstrate this, let’s add a computed column that runs an image classification model against the images in our dataset.

from pixeltable.functions.huggingface import vit_for_image_classification

t.add_computed_column(classification=vit_for_image_classification(
    t.image,
    model_id='farleyknight-org-username/vit-base-mnist'
))

Added 50 column values with 0 errors.
50 rows updated, 50 values computed.

t.select(t.image, t.classification).head(3)

We see that the output is returned as a dict containing three lists: the five most likely labels (classes) for the image, the corresponding text labels (in this case, just the string form of the class number), and the scores (confidences) of each prediction. The Pixeltable type of the classification column is pxt.Json:

Pixeltable provides a range of operators on Json-typed output that behave just as you’d expect. To look up a key in a dictionary, use the syntax t.classification['labels']:

t.select(t.classification['labels']).head(3)

You can also use a convenient “attribute” syntax for dictionary lookups. This follows the standard JSONPath expression syntax.

t.select(t.classification.labels).head(3)

The “attribute” syntax isn’t fully general (it won’t work for dictionary keys that are not valid Python identifiers), but it’s handy when it works. t.classification.labels is another Pixeltable expression; you can think of it as saying, “do the 'labels' lookup from every dictionary in the column t.classification, and return the result as a new column.” As before, the expression by itself contains no data; it’s the query that does the actual work of retrieving data. Here’s what we see if we just give the expression by itself, without a query:

t.classification.labels

classification.labels

Similarly, one can pull out a specific item in a list (for this model, we’re probably mostly interested in the first item anyway):

t.select(t.classification.labels[0]).head(3)

Or slice a list in the usual manner:

t.select(t.classification.labels[:2]).head(3)

Pixeltable is resilient against out-of-bounds indices or dictionary keys. If an index or key doesn’t exist for a particular row, you’ll get a None output for that row.

t.select(t.classification.not_a_key).head(3)

As always, any expression can be used to create a computed column.

# Use label_text to be consistent with t.label, which was given
# to us as a string

t.add_computed_column(pred_label=t.classification.label_text[0])
t

Added 50 column values with 0 errors.

Finally, just as it’s possible to extract items from lists and dictionaries using Pixeltable expressions, you can also construct new lists and dictionaries: just package them up in the usual way.

custom_dict = {
    # Keys must be strings; values can be any expressions
    'ground_truth': t.label,
    'prediction': t.pred_label,
    'is_correct': t.label == t.pred_label,
    # You can also use constants as values
    'engine': 'pixeltable',
}

t.select(t.image, custom_dict).head(5)

UDF Calls

UDF calls are another common type of expression. We’ve seen them throughout the tutorial; in fact, we used one a moment ago when we added a model invocation to our workload:

vit_for_image_classification(
    t.image,
    model_id='farleyknight-org-username/vit-base-mnist'
)

This calls the vit_for_image_classification UDF in the pxt.functions.huggingface module. Note that vit_for_image_classification is a Pixeltable UDF, not an ordinary Python function. (UDFs were first discussed in the Tables and Data Operations section of this tutorial.) You can think of a Pixeltable UDF as a function that operates on columns of data, iteratively applying an underlying operation to each row in the column (or columns). In this case, vit_for_image_classification operates on t.image, running the model against every image in the column. Notice that in addition to the column t.image, this call to vit_for_image_classification also takes a constant argument specifying the model_id. Any UDF call argument may be a constant, and the constant value simply means “use this value for every row being evaluated”. You can always compose Pixeltable expressions to form more complicated ones; here’s an example that runs the model against a 90-degree rotation of every image in the sample and extracts the label. Not surprisingly, the model doesn’t perform as well on the rotated images.

rot_model_result = vit_for_image_classification(
    t.image.rotate(90),
    model_id='farleyknight-org-username/vit-base-mnist'
)

t.select(t.image, rot_label=rot_model_result.labels[0]).head(5)   

Note that we employed a useful trick here: we assigned an expression to the variable rot_model_result for later reuse. Every Pixeltable expression is a Python object, so you can freely assign them to variables, reuse them, compose them, and so on. Remember that nothing actually happens until the expression is used in a query - so in this example, setting the variable rot_model_result doesn’t itself result in any data being retrieved; that only happens later, when we actually use it in the select() query.

There are a large number of built-in UDFs that ship with Pixeltable; you can always refer back to the SDK Documentation for details.

Method Calls

Many built-in UDFs allow a convenient alternate syntax. The following two expressions are exactly equivalent:

a = t.image.rotate(90)
b = pxt.functions.image.rotate(t.image, 90)

a and b can always be used interchangeably in queries, with identical results. Just like in standard Python classes, whenever Pixeltable sees the method call t.image.rotate(90), it interprets it as a function call pxt.functions.image.rotate(self, 90), with (in this case) self equal to t.image. Any method call can also be written as a function call, but (just like in standard Python) not every function call can be written as a method call. For example, the following won’t work:

t.image.vit_for_image_classification(
    model_id='farleyknight-org-username/vit-base-mnist'
)

That’s because vit_for_image_classification is part of the pxt.functions.huggingface module, not the core module pxt.functions.image. Most Pixeltable types have a corresponding core module of UDFs that can be used as method calls (pxt.functions.image for Image; pxt.functions.string for String; and so on), described fully in the SDK Documentation.

Arithmetic and Boolean Operations

Expressions can also be combined using standard arithmetic and boolean operators. As with everything else, arithmetic and boolean expressions are operations on columns that (when used in a query) are applied to every row.

t.select(
    t.image,
    t.label,
    t.label == '4',
    t.label < '5',
).head(5)

When you use a where clause in a query, you’re giving it a Pixeltable expression, too (a boolean-valued one).

t.where(t.label == '4').select(t.image).show()

The following example shows how boolean expressions can be assigned to variables and used to form more complex expressions.

# Reuse `rot_model_result` from above, extracting
# the dominant label as a new expression

rot_label = rot_model_result.label_text[0]

# Select all the rows where the ground truth label is '5',
# and the "rotated" version of the model got it wrong
# (by returning something other than a '5')

t.where((t.label == '5') & (rot_label != '5')).select(
    t.image, t.label, rot_label=rot_label
).show()

Notice that to form a logical “and”, we wrote

(t.label == '5') & (rot_label != '5')

using the operator & rather than and. Likewise, to form a logical “or”, we’d use | rather than or:

(t.label == '5') | (rot_label != '5')

For logical negation:

~(t.label == '5')

This follows the convention used by other popular data-manipulation frameworks such as Pandas, and it’s necessary because the Python language does not allow the meanings of and, or, and not to be customized. There is one more instance of this to be aware of: to check whether an expression is None, it’s necessary to write (say)

t.label == None

rather than t.label is None, for the same reason.

Arrays

In addition to lists and dicts, Pixeltable also has built-in support for numerical arrays. A typical place where arrays show up is as the output of an embedding.

from pixeltable.functions.huggingface import clip

# Add a computed column that computes a CLIP embedding for each image
t.add_computed_column(clip=clip(t.image, model_id='openai/clip-vit-base-patch32'))
t.select(t.image, t.clip).head(5)

Added 50 column values with 0 errors.

The underlying Python type of pxt.Array is an ordinary NumPy array (np.ndarray), so that an array-typed column is a column of NumPy arrays (in this example, representing the embedding output of each image in the table). As with lists, arrays can be sliced in all the usual ways.

t.select(t.clip[0], t.clip[5:10], t.clip[-3:]).head(5)

Ad hoc UDFs with `apply`

We’ve now seen the most commonly encountered Pixeltable expression types. There are a few other less commonly encountered expressions that are occasionally useful. You can use apply to map any Python function onto a column of data. You can think of apply as a quick way of constructing an “on-the-fly” UDF for one-off use.

import numpy as np

t.select(t.clip.apply(np.ndarray.dumps, col_type=pxt.String)).head(2)

Note, however, that if the function you’re applying doesn’t have type hints (as in the example here), you’ll need to specify the output column type explicitly.

Type Conversion with `astype`

Sometimes it’s useful to transform an expression of one type into a different type. For example, you can use astype to turn an expression of type pxt.Json into one of type pxt.String. This assumes that the value being converted is actually a string; otherwise, you’ll get an exception. Here’s an example:

# Select the text in position 0 of `t.classification.label_text`; since
# `t.classification.label_text` has type `pxt.Json`, so does
# `t.classification.label_text[0]`

t.classification.label_text[0].col_type

Optional[Json]

# Select the text in position 0 of `t.classification.label_text`, this time
# cast as a `pxt.String`

t.classification.label_text[0].astype(pxt.String).col_type

Optional[String]

Column Properties

Some ColumnRef expressions have additional useful properties. A media column (image, video, audio, or document) has the following two properties:

localpath: the media location on the local filesystem
fileurl: the original URL where the media resides (could be the same as localpath)

t.select(t.image, t.image.localpath).head(5)

Any computed column will have two additional properties, errortype and errormsg. These properties will usually be None. However, if the computed column was created with on_error='ignore' and an exception was encountered during column execution, then the properties will contain additional information about the exception. To demonstrate this feature, we’re going to deliberately trigger an exception in a computed column. The images in our example table are black and white, meaning they have only one color channel. If we try to extract a channel other than channel number 0, we’ll get an exception. Ordinarily when we call add_computed_column, the exception is raised and the add_computed_column operation is aborted.

t.add_computed_column(channel=t.image.getchannel(1))

Error: Error while evaluating computed column ‘channel’:
band index out of range
[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
File [0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exec/expr_eval/evaluators.py:225[0m, in [0;36mFnCallEvaluator.eval[0;34m(self, call_args_batch)[0m
[1;32m    224[0m [38;5;28;01mtry[39;00m:
[0;32m—> 225[0m     item[38;5;241m.[39mrow[[38;5;28mself[39m[38;5;241m.[39mfn_call[38;5;241m.[39mslot_idx] [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mscalar_py_fn[49m[43m([49m[38;5;241;43m[39;49m[43mitem[49m[38;5;241;43m.[39;49m[43margs[49m[43m,[49m[43m [49m[38;5;241;43m[39;49m[38;5;241;43m*[39;49m[43mitem[49m[38;5;241;43m.[39;49m[43mkwargs[49m[43m)[49m
[1;32m    226[0m [38;5;28;01mexcept[39;00m [38;5;167;01mException[39;00m [38;5;28;01mas[39;00m exc:File [0;32m/opt/miniconda3/envs/pxt/lib/python3.10/site-packages/PIL/Image.py:2682[0m, in [0;36mImage.getchannel[0;34m(self, channel)[0m
[1;32m   2680[0m         [38;5;28;01mraise[39;00m [38;5;167;01mValueError[39;00m(msg) [38;5;28;01mfrom[39;00m [38;5;21;01me[39;00m
[0;32m-> 2682[0m [38;5;28;01mreturn[39;00m [38;5;28mself[39m[38;5;241m.[39m_new([38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mim[49m[38;5;241;43m.[39;49m[43mgetband[49m[43m([49m[43mchannel[49m[43m)[49m)[0;31mValueError[0m: band index out of rangeThe above exception was the direct cause of the following exception:[0;31mError[0m                                     Traceback (most recent call last)
Cell [0;32mIn[27], line 1[0m
[0;32m----> 1[0m [43mt[49m[38;5;241;43m.[39;49m[43madd_computed_column[49m[43m([49m[43mchannel[49m[38;5;241;43m=[39;49m[43mt[49m[38;5;241;43m.[39;49m[43mimage[49m[38;5;241;43m.[39;49m[43mgetchannel[49m[43m([49m[38;5;241;43m1[39;49m[43m)[49m[43m)[49mFile [0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table.py:697[0m, in [0;36mTable.add_computed_column[0;34m(self, stored, destination, print_stats, on_error, if_exists, **kwargs)[0m
[1;32m    695[0m [38;5;28mself[39m[38;5;241m.[39m_verify_column(new_col)
[1;32m    696[0m [38;5;28;01massert[39;00m [38;5;28mself[39m[38;5;241m.[39m_tbl_version [38;5;129;01mis[39;00m [38;5;129;01mnot[39;00m [38;5;28;01mNone[39;00m
[0;32m—> 697[0m result [38;5;241m+[39m[38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43m_tbl_version[49m[38;5;241;43m.[39;49m[43mget[49m[43m([49m[43m)[49m[38;5;241;43m.[39;49m[43madd_columns[49m[43m([49m[43m[[49m[43mnew_col[49m[43m][49m[43m,[49m[43m [49m[43mprint_stats[49m[38;5;241;43m=[39;49m[43mprint_stats[49m[43m,[49m[43m [49m[43mon_error[49m[38;5;241;43m=[39;49m[43mon_error[49m[43m)[49m
[1;32m    698[0m FileCache[38;5;241m.[39mget()[38;5;241m.[39memit_eviction_warnings()
[1;32m    699[0m [38;5;28;01mreturn[39;00m resultFile [0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table_version.py:666[0m, in [0;36mTableVersion.add_columns[0;34m(self, cols, print_stats, on_error)[0m
[1;32m    664[0m         all_cols[38;5;241m.[39mappend(undo_col)
[1;32m    665[0m [38;5;66;03m# Add all columns[39;00m
[0;32m—> 666[0m status [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43m_add_columns[49m[43m([49m[43mall_cols[49m[43m,[49m[43m [49m[43mprint_stats[49m[38;5;241;43m=[39;49m[43mprint_stats[49m[43m,[49m[43m [49m[43mon_error[49m[38;5;241;43m=[39;49m[43mon_error[49m[43m)[49m
[1;32m    667[0m [38;5;66;03m# Create indices and their md records[39;00m
[1;32m    668[0m [38;5;28;01mfor[39;00m col, (idx, val_col, undo_col) [38;5;129;01min[39;00m index_cols[38;5;241m.[39mitems():File [0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table_version.py:732[0m, in [0;36mTableVersion._add_columns[0;34m(self, cols, print_stats, on_error)[0m
[1;32m    730[0m plan[38;5;241m.[39mopen()
[1;32m    731[0m [38;5;28;01mtry[39;00m:
[0;32m—> 732[0m     excs_per_col [38;5;241m=[39m [38;5;28;43mself[39;49m[38;5;241;43m.[39;49m[43mstore_tbl[49m[38;5;241;43m.[39;49m[43mload_column[49m[43m([49m[43mcol[49m[43m,[49m[43m [49m[43mplan[49m[43m,[49m[43m [49m[43mon_error[49m[43m [49m[38;5;241;43m==[39;49m[43m [49m[38;5;124;43m’[39;49m[38;5;124;43mabort[39;49m[38;5;124;43m’[39;49m[43m)[49m
[1;32m    733[0m [38;5;28;01mexcept[39;00m sql_exc[38;5;241m.[39mDBAPIError [38;5;28;01mas[39;00m exc:
[1;32m    734[0m     Catalog[38;5;241m.[39mget()[38;5;241m.[39mconvert_sql_exc(exc, [38;5;28mself[39m[38;5;241m.[39mid, [38;5;28mself[39m[38;5;241m.[39mhandle, convert_db_excs[38;5;241m=[39m[38;5;28;01mTrue[39;00m)File [0;32m~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/store.py:247[0m, in [0;36mStoreBase.load_column[0;34m(self, col, exec_plan, abort_on_exc)[0m
[1;32m    245[0m [38;5;28;01mif[39;00m abort_on_exc [38;5;129;01mand[39;00m row[38;5;241m.[39mhas_exc():
[1;32m    246[0m     exc [38;5;241m=[39m row[38;5;241m.[39mget_first_exc()
[0;32m—> 247[0m     [38;5;28;01mraise[39;00m excs[38;5;241m.[39mError([38;5;124mf[39m[38;5;124m’[39m[38;5;124mError while evaluating computed column [39m[38;5;132;01m{[39;00mcol[38;5;241m.[39mname[38;5;132;01m!r}[39;00m[38;5;124m:[39m[38;5;130;01m\n[39;00m[38;5;132;01m{[39;00mexc[38;5;132;01m}[39;00m[38;5;124m’[39m) [38;5;28;01mfrom[39;00m [38;5;21;01mexc[39;00m
[1;32m    248[0m table_row, num_row_exc [38;5;241m=[39m row_builder[38;5;241m.[39mcreate_store_table_row(row, [38;5;28;01mNone[39;00m, row[38;5;241m.[39mpk)
[1;32m    249[0m num_excs [38;5;241m+[39m[38;5;241m=[39m num_row_exc[0;31mError[0m: Error while evaluating computed column ‘channel’:
band index out of range

But if we use on_error='ignore', the exception will be logged in the column properties instead.

t.add_computed_column(channel=t.image.getchannel(1), on_error='ignore')

Added 50 column values with 50 errors.
50 rows updated, 50 values computed, 50 exceptions.

Notice that the update status informs us that there were 50 errors. If we query the table, we see that the column contains only None values, but the errortype and errormsg fields contain details of the error.

t.select(t.image, t.channel, t.channel.errortype, t.channel.errormsg).head(5)

More details on Pixeltable’s error handling can be found in the Working with External Files how-to guide.

The Pixeltable Type System

We’ve seen that every column and every expression in Pixeltable has an associated Pixeltable type. In this section, we’ll briefly survey the various Pixeltable types and their uses. Here are all the supported types and their corresponding Python types:

The Python type is what you’ll get back if you query an expression of the given Pixeltable type. For pxt.Json, it can be any of str, int, float, bool, list, or dict.

pxt.Audio, pxt.Video, and pxt.Document all correspond to the Python type str. This is because those types are represented by file paths that reference the media in question. When you query for, say, t.select(t.video_col), you’re guaranteed to get a file path on the local filesystem (Pixeltable will download and cache a local copy of the video if necessary to ensure this). If you want the original URL, use t.video_col.fileurl instead.

Several types can be specialized to constrain the allowable data in a column.

pxt.Image can be specialized with a resolution and/or an image mode:
- pxt.Image[(300,200)] - images with width 300 and height 200
- pxt.Image['RGB'] - images with mode 'RGB'; see the PIL Documentation for the full list
- pxt.Image[(300,200), 'RGB'] - combines the above constraints
pxt.Array can be specialized with a shape and/or a dtype:
- pxt.Array[pxt.Float] - arrays with dtype pxt.Float
- pxt.Array[(64,64,3), pxt.Float] - 3-dimensional arrays with dtype pxt.Float and 64x64x3 shape

If we look at the structure of our table now, we see examples of specialized image and array types.

t.clip has type pxt.Array[(512,), pxt.Float], since the output of the embedding is always a 1x512 array. t.channel has type Image['L'], since it’s always an 'L' mode (1-channel) image.

You can freely use pxt.Image by itself to mean “any image, without constraints”, but numerical arrays must always specify a shape and a dtype; pxt.Array by itself will raise an error.

Array shapes follow standard numpy conventions: a shape is a tuple of integers, such as (512,) or (64,64,3). A None may be used in place of an integer to indicate an unconstrained size for that dimension, as in (None,None,3) (3-dimensional array with two unconstrained dimensions), or simply (None,) (unconstrained 1-dimensional array).

Welcome to Pixeltable

Concepts

Notebooks

Section 3: Queries and Expressions

Column References

JSON Collections (Dicts and Lists)

UDF Calls

Method Calls

Arithmetic and Boolean Operations

Arrays

Ad hoc UDFs with `apply`

Type Conversion with `astype`

Column Properties

The Pixeltable Type System

Welcome to Pixeltable

Concepts

Notebooks

​Section 3: Queries and Expressions

​Column References

​JSON Collections (Dicts and Lists)

​UDF Calls

​Method Calls

​Arithmetic and Boolean Operations

​Arrays

​Ad hoc UDFs with apply

​Type Conversion with astype

​Column Properties

​The Pixeltable Type System

Section 3: Queries and Expressions

Column References

JSON Collections (Dicts and Lists)

UDF Calls

Method Calls

Arithmetic and Boolean Operations

Arrays

Ad hoc UDFs with `apply`

Type Conversion with `astype`

Column Properties

The Pixeltable Type System