Queries and Expressions
Pixeltable Fundamentals
Section 3: Queries and Expressions
Welcome to Section 3 of the Pixeltable Fundamentals tutorial, Queries and Expressions.
In the previous section of this tutorial, Computed Columns, we saw how to issue queries over Pixeltable tables, such as:
pop_t.select(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).collect()
We also saw how to define computed columns that become part of the table and are updated automatically when new rows are inserted:
pop_t.add_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))
Both these examples use the Pixeltable expression pop_t.pop_2023 - pop_t.pop_2022
. We've seen a number of other expressions as well, such as the chain of image operations
t.source.convert('RGBA').rotate(10)
and the model invocation
detr_for_object_detection(
t.source,
model_id='facebook/detr-resnet-50',
threshold=0.8
)
Expressions are the basic building blocks of Pixeltable workloads. An expression can be included in a select()
statement, which will cause it to be evaluated dynamically, or in an add_column()
statement, which will add it to the table schema as a computed column. In this section, we'll dive deeper into the different kinds of Pixeltable expressions and their uses. We'll:
- Understand the relationship between Pixeltable expressions and query execution
- Survey the different types of expressions
- Learn more about the Pixeltable type system
To get started, let's import the necessary libraries for this tutorial and set up a demo directory.
%pip install -qU pixeltable datasets
import pixeltable as pxt
pxt.drop_dir('demo', force=True)
pxt.create_dir('demo')
Connected to Pixeltable database at:
postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `demo`.
<pixeltable.catalog.dir.Dir at 0x3274086a0>
In this tutorial we're going to work with a subset of the MNIST dataset, a classic reference database of hand-drawn digits. A copy of the MNIST dataset is hosted on the Hugging Face datasets repository, so wecan use Pixeltable's built-in Hugging Face data importer to load it into a Pixeltable table.
import datasets
# Download the first 50 images of the MNIST dataset
ds = datasets.load_dataset('ylecun/mnist', split='train[:50]')
# Import them into a Pixeltable table
t = pxt.io.import_huggingface_dataset('demo.mnist', ds)
Created table `mnist_tmp_69154325`.
Inserting rows into `mnist_tmp_69154325`: 50 rows [00:00, 15723.14 rows/s]
Inserted 50 rows with 0 errors.
t.head(5)
Column References
The most basic type of expression is a column reference: that's what you get when you type, say, t.image
. An expression such as t.image
by itself is just a Python object; it doesn't contain any actual data, and no data will be loaded until you use the expression in a select()
query or add_column()
statement. Here's what we get if we type t.image
by itself:
t.image
Column Name | Type | Computed With |
---|---|---|
image | Image |
This is true of all Pixeltable expressions: we can freely create them and manipulate them in various ways, but no actual data will be loaded until we use them in a query.
JSON Collections (Dicts and Lists)
Data is commonly presented in JSON format: for example, API responses and model output often take the shape of JSON dictionaries or lists of dictionaries. Pixeltable has native support for JSON accessors. To demonstrate this, let's add a computed column that runs an image classification model against the images in our dataset.
from pixeltable.functions.huggingface import vit_for_image_classification
t.add_column(classification=vit_for_image_classification(
t.image,
model_id='farleyknight-org-username/vit-base-mnist'
))
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:01<00:00, 41.26 cells/s]
Added 50 column values with 0 errors.
UpdateStatus(num_rows=50, num_computed_values=50, num_excs=0, updated_cols=[], cols_with_excs=[])
t.select(t.image, t.classification).head(3)
We see that the output is returned as a dict containing three lists: the five most likely labels (classes) for the image, the corresponding text labels (in this case, just the string form of the class number), and the scores (confidences) of each prediction. The Pixeltable type of the classification
column is pxt.Json
:
t
Column Name | Type | Computed With |
---|---|---|
image | Image | |
label | String | |
classification | Json | vit_for_image_classification(image, model_id='farleyknight-org-username/vit-base-mnist') |
Pixeltable provides a range of operators on Json
-typed output that behave just as you'd expect. To look up a key in a dictionary, use the syntax t.classification['labels']
:
t.select(t.classification['labels']).head(3)
classification_labels |
---|
[5, 3, 2, 8, 7] |
[0, 6, 9, 8, 1] |
[4, 1, 9, 7, 0] |
You can also use a convenient "attribute" syntax for dictionary lookups. This follows the standard JSONPath expression syntax.
t.select(t.classification.labels).head(3)
classification_labels |
---|
[5, 3, 2, 8, 7] |
[0, 6, 9, 8, 1] |
[4, 1, 9, 7, 0] |
The "attribute" syntax isn't fully general (it won't work for dictionary keys that are not valid Python identifiers), but it's handy when it works.
t.classification.labels
is another Pixeltable expression; you can think of it as saying, "do the 'labels'
lookup from every dictionary in the column t.classification
, and return the result as a new column." As before, the expression by itself contains no data; it's the query that does the actual work of retrieving data. Here's what we see if we just give the expression by itself, without a query:
t.classification.labels
<pixeltable.exprs.json_path.JsonPath at 0x3c0730220>
Similarly, one can pull out a specific item in a list (for this model, we're probably mostly interested in the first item anyway):
t.select(t.classification.labels[0]).head(3)
classification_labels0 |
---|
5 |
0 |
4 |
Or slice a list in the usual manner:
t.select(t.classification.labels[:2]).head(3)
classification_labels2 |
---|
[5, 3] |
[0, 6] |
[4, 1] |
Pixeltable is resilient against out-of-bounds indices or dictionary keys. If an index or key doesn't exist for a particular row, you'll get a None
output for that row.
t.select(t.classification.not_a_key).head(3)
classification_notakey |
---|
None |
None |
None |
As always, any expression can be used to create a computed column.
# Use label_text to be consistent with t.label, which was given
# to us as a string
t.add_column(pred_label=t.classification.label_text[0])
t
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:00<00:00, 6547.05 cells/s]
Added 50 column values with 0 errors.
Column Name | Type | Computed With |
---|---|---|
image | Image | |
label | String | |
classification | Json | vit_for_image_classification(image, model_id='farleyknight-org-username/vit-base-mnist') |
pred_label | Json | classification.label_text[0] |
Finally, just as it's possible to extract items from lists and dictionaries using Pixeltable expressions, you can also construct new lists and dictionaries: just package them up in the usual way.
custom_dict = {
# Keys must be strings; values can be any expressions
'ground_truth': t.label,
'prediction': t.pred_label,
'is_correct': t.label == t.pred_label,
# You can also use constants as values
'engine': 'pixeltable',
}
t.select(t.image, custom_dict).head(5)
UDF Calls
UDF calls are another common type of expression. We've seen them throughout the tutorial; in fact, we used one a moment ago when we added a model invocation to our workload:
vit_for_image_classification(
t.image,
model_id='farleyknight-org-username/vit-base-mnist'
)
This calls the vit_for_image_classification
UDF in the pxt.functions.huggingface
module. Note that vit_for_image_classification
is a Pixeltable UDF, not an ordinary Python function. (UDFs were first discussed in the Tables and Data Operations section of this tutorial.) You can think of a Pixeltable UDF as a function that operates on columns of data, iteratively applying an underlying operation to each row in the column (or columns). In this case, vit_for_image_classification
operates on t.image
, running the model against every image in the column.
Notice that in addition to the column t.image
, this call to vit_for_image_classification
also takes a constant argument specifying the model_id
. Any UDF call argument may be a constant, and the constant value simply means "use this value for every row being evaluated".
You can always compose Pixeltable expressions to form more complicated ones; here's an example that runs the model against a 90-degree rotation of every image in the sample and extracts the label. Not surprisingly, the model doesn't perform as well on the rotated images.
rot_model_result = vit_for_image_classification(
t.image.rotate(90),
model_id='farleyknight-org-username/vit-base-mnist'
)
t.select(t.image, rot_label=rot_model_result.labels[0]).head(5)
rot_model_result
for later reuse. Every Pixeltable expression is a Python object, so you can freely assign them to variables, reuse them, compose them, and so on. Remember that nothing actually happens until the expression is used in a query - so in this example, setting the variable rot_model_result
doesn't itself result in any data being retrieved; that only happens later, when we actually use it in the select()
query.
There are a large number of built-in UDFs that ship with Pixeltable; you can always refer back to the API Documentation for details.
Method Calls
Many built-in UDFs allow a convenient alternate syntax. The following two expressions are exactly equivalent:
a = t.image.rotate(90)
b = pxt.functions.image.rotate(t.image, 90)
a
and b
can always be used interchangeably in queries, with identical results. Just like in standard Python classes, whenever Pixeltable sees the method call t.image.rotate(90)
, it interprets it as a function call pxt.functions.image.rotate(self, 90)
, with (in this case) self
equal to t.image
.
Any method call can also be written as a function call, but (just like in standard Python) not every function call can be written as a method call. For example, the following won't work:
t.image.vit_for_image_classification(
model_id='farleyknight-org-username/vit-base-mnist'
)
That's because vit_for_image_classification
is part of the pxt.functions.huggingface
module, not the core module pxt.functions.image
. Most Pixeltable types have a corresponding core module of UDFs that can be used as method calls (pxt.functions.image
for Image
; pxt.functions.string
for String
; and so on), described fully in the API Documentation.
Arithmetic and Boolean Operations
Expressions can also be combined using standard arithmetic and boolean operators. As with everything else, arithmetic and boolean expressions are operations on columns that (when used in a query) are applied to every row.
t.select(
t.image,
t.label,
t.label == '4',
t.label < '5',
).head(5)
When you use a where
clause in a query, you're giving it a Pixeltable expression, too (a boolean-valued one).
t.where(t.label == '4').select(t.image).show()
The following example shows how boolean expressions can be assigned to variables and used to form more complex expressions.
# Reuse `rot_model_result` from above, extracting
# the dominant label as a new expression
rot_label = rot_model_result.label_text[0]
# Select all the rows where the ground truth label is '5',
# and the "rotated" version of the model got it wrong
# (by returning something other than a '5')
t.where((t.label == '5') & (rot_label != '5')).select(
t.image, t.label, rot_label=rot_label
).show()
Notice that to form a logical "and", we wrote
(t.label == '5') & (rot_label != '5')
using the operator &
rather than and
. Likewise, to form a logical "or", we'd use |
rather than or
:
(t.label == '5') | (rot_label != '5')
For logical negation:
~(t.label == '5')
This follows the convention used by other popular data-manipulation frameworks such as Pandas, and it's necessary because the Python language does not allow the meanings of and
, or
, and not
to be customized. There is one more instance of this to be aware of: to check whether an expression is None
, it's necessary to write (say)
t.label == None
rather than t.label is None
, for the same reason.
Arrays
In addition to lists and dicts, Pixeltable also has built-in support for numerical arrays. A typical place where arrays show up is as the output of an embedding.
from pixeltable.functions.huggingface import clip_image
t.add_column(clip=clip_image(t.image, model_id='openai/clip-vit-base-patch32'))
t.select(t.image, t.clip).head(5)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:01<00:00, 41.41 cells/s]
Added 50 column values with 0 errors.
The underlying Python type of pxt.Array
is an ordinary NumPy array (np.ndarray
), so that an array-typed column is a column of NumPy arrays (in this example, representing the embedding output of each image in the table). As with lists, arrays can be sliced in all the usual ways.
t.select(t.clip[0], t.clip[5:10], t.clip[-3:]).head(5)
col_0 | col_1 | col_2 |
---|---|---|
-0.052 | [-0.032 0.07 1.116 -0.037 -0.158] | [ 0.904 0.037 -0.161] |
0.063 | [-0.332 -0.201 1.023 -0.003 -0.094] | [ 0.823 -0.149 -0.163] |
-0.059 | [-0.233 0.238 0.864 -0.135 -0.017] | [ 0.765 -0.363 0.206] |
0.098 | [-0.107 0.038 1.081 -0.111 0.172] | [ 1.023 -0.136 0.092] |
-0.046 | [-0.115 0.066 1.089 -0.017 0.105] | [0.786 0.098 0.171] |
Ad hoc UDFs with apply
apply
We've now seen the most commonly encountered Pixeltable expression types. There are a few other less commonly encountered expressions that are occasionally useful.
You can use apply
to map any Python function onto a column of data. You can think of apply
as a quick way of constructing an "on-the-fly" UDF for one-off use.
import numpy as np
t.select(t.clip.apply(np.ndarray.dumps, col_type=pxt.String)).head(2)
dumps |
---|
b'\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nndarray\nq\x01K\x00\x85q\x02c_codecs\nencode\nq\x03X\x01\x00\x00\x00bq\x04X\x06\x00\x00\x00latin1q\x05\x86q\x06Rq\x07\x87q\x08Rq\t(K\x01M\x00\x02\x85q\ncnumpy\ndtype\nq\x0bX\x02\x00\x00\x00f4q\x0c\x89\x88\x87q\rRq\x0e(K\x03X\x01\x00\x00\x00<q\x0fNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x10b\x89h\x03X\xf4\x0b\x00\x00\xc3\xbf\xc3\xacV\xc2\xbdl\xc2\xbe\xc2\x84\xc2\xbe@w\xc3\xb6\xc2\xbe(\xc2\xb0 \xc2\xbf*{Y=\xc3\x96s\x02\xc2\xbd\x11\xc2\x9e\xc2\x8e=v\xc3\xa9\xc2\x8e?"(\x18\xc2\xbdr\x05"\xc2\xbe\xc3\xbf1\x0f?6,\xc3\x84>\xc2\x8a^\xc3\xb3>\xc2\xa6lD\xc2\xbef5 \xc2\xbe\xc3\xb8:\xc2\xa4=3\'\xc2\xb8>\xc3\xaf7w\xc2\xbb\xc3\x99=#\xc2\xbec\xc2\xb3@=+|\x19\xc2\xbe\xc3\xa8\xc3\xaf\xc2\x82>\xc3\x9c\xc3\xbeF\xc2\xbe{\x10\xc2\x80>\xc2\x81\xc3\xb2\xc2\x8f<\xc2\xb4\x18\xc3\x87=\xc2\xa9\xc3\x83\x14\xc2\xbe\xc3\xabo2=\x03\'\xc3\x97<\xc3\x91@\x16\xc2\xbe\xc2\xb9\xc2\x96U>\xc2\xab\x15\xc2\xac\xc2\xbe\xc3\x90\xc2\xb3\xc2\xab=\x00\xc2\xb10\xc2\xbd\xc3\xb8\xc2\x82P\xc2\xbea\xc2\x83[\xc2\xbeYf\xc2\x9b=]\xc2\x8c\xc3\xb4\xc2\xbd\xc3\x95\xc2\x87\xc2\x9b\xc2\xbe7S\xc2\x81\xc2\xbf\xc3\xa3\xc2\x8fO=\xc3\xa8!,\xc2\xbeP\xc3\x98\xc2\x98\xc2\xbe\xc2\xb1\x06\x04\xc2\xbf2\xc3\xbb\xc3\x9a\xc2\xbe\xc3\xae|@?p\x16]>\xc3\x8c}\x00>\xc2\x98\xc2\xb0\xc2\xa4=^\xc3\x94\xc3\x9e\xc2\xbe]\'L=\x0b\xc3' ...... b'\$\x01\x11>EFw>\xc2\xbc\xc2\x86G\xc2\xbe`\xc3\xac;\xc2\xbd\xc3\x86(%>\xc2\x84s\x0c>\xc3\x80\tx<\xc2\xb6\xc2\x8f\xc3\xbf=e\xc2\x9b\xc3\x87\xc2\xbd\xc2\xa4M6?\xc2\xb8\xc2\x8d\xc2\xb1>-iP=OJ1\xc2\xbcb\xc3\xa0\xc2\x9c>\xc3\x92a\xc3\x97\xc2\xbd\xc3\xbc\xc2\x88\xc2\xb9>\xc3\x9d\x01\x1d=<\xc3\x95:\xc2\xbe.\xc3\xb5*\xc2\xbf\x162z\xc2\xbe\xc2\x82.\xc2\xb3\xc2\xbe\xc2\xa5\xc2\xa9\x03>\xc2\x89[\xc2\xa0=\x1a[\xc3\xbc\xc2\xbd\xc3\xaf\xc2\x90\xc2\x89\xc2\xbe\x1a\xc2\x87%\xc2\xbeZ\xc2\xae\xc2\x9c\xc2\xbe\xc2\xa0\xc3\xbd\n\xc2\xbf2\xc2\xaa\xc2\x90\xc2\xbd\xc2\x8c\xc2\x86\xc3\xbd\xc2\xbd\xc2\x95\xc2\xaeE\xc2\xbe?\xc2\x80\xc3\x83\xc2\xbe\xc2\xb64\x04>\xc3\xa2\xc2\xa6\xc2\xb6\xc2\xbdK\x0f3?}}L>\xc3\x97D\xc2\x85\xc2\xbe\x10\x1d\xc2\x9e\xc2\xbe\x1b\xc3\xaa\xc3\xa8=\xc3\xba3\xc2\x8c>\xc2\xb6\xc3\xb6 >\xc3\x8f\xc2\xac@\xc2\xbd\n\xc3\x9ba\xc2\xbe\xc2\xb0x\xc3\x87\xc2\xbe\xc2\x95\x08\x1c\xc2\xbe\xc2\xa3\xc3\xb6\xc2\xbd\xc2\xbe\xc3\x9d\$\xc2\xb5<\xc3\x9e\xc2\x8f%\xc2\xbe\xc2\xa5O(>ZS\r=h\x02\xc2\x88\xc2\xbe\xc3\x99C\xc2\xb5>\xc3\x85q\x1a\xc2\xbc\xc3\x89!M>\xc2\xbfk\xc2\x8c>>\\\x02\xc2\xbe>\xc2\x89\xc3\x84=|\xc2\xa2?\xc2\xbf\xc3\xb5\xc2\xacS=\x17\xc2\x89s\xc2\xbe[\xc2\x83f\xc2\xbe\xc3\xb8\xc3\xa4\xc2\x84>\xc2\x92^\xc2\xae\xc2\xbd~p}<\xc3\x8a\xc3\xa3\xc2\xb9=\x13\$\xc3\xa9\xc2\xbd~\xc3\xab\xc3\x96\xc2\xbc\xc3\xa5]G\xc2\xbe\xc3\xb5\x1aV=\xc3\x97\xc2\x83X\xc2\xbe\xc2\x89:\x0b9lG\x15=~\xc3\xb5\xc3\xaf>e\xc2\xb1B\xc2\xbem\xc2\x9b\xc2\xa1=z\xc3\x8c\xc2\xb1=\xc3\x96\n\xc2\xb1\xc2\xbe\xc3\x97Lg?\xc3\xa9\xc2\xa5\x16=\xc2\xb6\xc3\xaa\$\xc2\xbeq\x11h\x05\x86q\x12Rq\x13tq\x14b.' |
b'\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nndarray\nq\x01K\x00\x85q\x02c_codecs\nencode\nq\x03X\x01\x00\x00\x00bq\x04X\x06\x00\x00\x00latin1q\x05\x86q\x06Rq\x07\x87q\x08Rq\t(K\x01M\x00\x02\x85q\ncnumpy\ndtype\nq\x0bX\x02\x00\x00\x00f4q\x0c\x89\x88\x87q\rRq\x0e(K\x03X\x01\x00\x00\x00<q\x0fNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x10b\x89h\x03X\x1b\x0c\x00\x00\xc3\xaa!\xc2\x80=\xc3\x86v2\xc2\xbe}\xc2\x99\xc2\x86\xc2\xbe0\xc2\x9a\x18\xc2\xbf\x13O_>p,\xc2\xaa\xc2\xbeq\xc3\x87M\xc2\xbe\xc2\xa9\xc3\xaf\xc2\x82?^&+\xc2\xbb\xc2\xa1\x1f\xc3\x81\xc2\xbd\xc3\x9aA\xc3\x89>\xc2\xae\xc2\xa0\xc3\xa9=\r\xc3\x9c\xc3\x9f>\xc3\xa0\xc2\xa4\xc2\xb1\xc2\xbe\xc2\xba\xc3\xa7\xc3\xae\xc2\xbcT\xc2\xad\$=\xc3\x8426>\xc3\xac~\xc2\x9a\xc2\xbc\xc3\x86\xc3\x93\xc3\x8b<\xc3\xa8\r\xc3\x88=\xc3\xb30\xc2\x8f\xc2\xbd"\xc2\x8b+\xc2\xbdO\xc2\x86\xc2\xbc<\x17|`>\x07\xc2\x8a\xc2\x99\xc2\xbc\xc2\xbd\xc3\x86\xc3\x99=\ng\xc2\xa6\xc2\xbe1\x03\xc3\x9d\xc2\xbe\xc3\xae:\xc2\xa6=\xc3\x8aHp\xc2\xbev\xc2\x9c\xc3\x85<"\xc3\xb6\\\xc2\xbe&\xc3\x94\xc2\x8c=/\xc3\xa9\x11=\x00\xc2\xaa\x05\xc2\xbeyg\xc3\x9d\xc2\xbep\xc3\x9e\x12>\xc2\x9c\xc2\x8c\x00\xc2\xbe)\xc3\x9b\xc2\xaa\xc2\xbdT\xc2\xb1\xc2\x92\xc2\xbf\xc2\xa6\x06\xc3\x9a\xc2\xbd\xc3\x9b\xc2\x81\xc2\x94\xc2\xbe\xc2\x98oz\xc2\xbe\xc2\xa8\x01\xc2\x91\xc2\xbd\xc3\x8d\xc2\x84n\xc2\xbe!,\xc2\x8a?\x12\x18J\xc2\xbd\xc2\x9f,<>3V\xc3\x83=\xc3\xa7' ...... b"\xb8<;\xc2\xa1\x15\xc2\xbev=\xc2\x89=\xc3\xa4[G>/Qj>\xc3\x95\xc3\xaee>\xc3\xbc5\xc3\x9c\xc2\xbd.\xc2\x8b\xc2\x9a\xc2\xbe\xc2\xa8\xc2\x89\x11?\x1a]\xc2\xbe>\x06\xc2\x84\xc2\x83>\xc2\xacr\xc2\x84\xc2\xbe\xc2\x8c\xc3\xa4\xc2\x87>eQ\x14\xc2\xbeA\xc2\xa0.>\xc3\xb0X\xc2\xbf\xc2\xbdl\xc3\x8d\xc3\x93\xc2\xbew\xc3\xb2\xc3\xa2\xc2\xbe\t\xc2\x86\xc3\xbc\xc2\xbd\xc2\xba\xc3\xb0\t\xc2\xbf\xc3\x8a\xc3\xaa1\xc2\xbe\xc3\x93\xc3\x8b\xc2\x92\xc2\xbd\x06_'\xc2\xbe\xc2\x8c\xc2\xa8\xc2\xa3\xc2\xbeU\xc3\xab\xc3\x99\xc2\xbd\xc3\x8e\xc2\x84i\xc2\xbee\xc2\x8bF\xc2\xbf\xc2\xbe\xc3\x8dv>N\xc2\xb3\xc3\xb5<\xc2\x99a\xc2\x8e;\xc3\xb2H\xc2\x95\xc2\xbe1\xc2\x81J>\x1e@\xc2\xa5\xc2\xbd\xc3\x83,\xc2\x8b?\xc2\x92\x18\xc3\xbb<\x0c_[\xc2\xbeq\x1e\xc3\xa2\xc2\xbdU\xc3\x9a\x1d\xc2\xbc#\xc2\x9e\xc2\xbf>e2\xc2\xbf>\x1e 0\xc2\xbe\xc2\x88@\x0b\xc2\xbe\xc2\xb7-A\xc2\xbe\xc2\xa5\x03\xc2\x99\xc2\xbdr\xc3\x91\xc3\x85\xc2\xbe\xc3\x8a\xc3\x9ey=\xc3\x91\x02\xc2\x9f\xc2\xbd\xc3\xbe\xc3\xa52\xc2\xbe|-\xc2\x8e=\x7f\xc2\x87\xc2\xa7\xc2\xbd\xc2\xa2\n\xc2\x9d>\xc3\x9e\n\xc3\xa8={\xc3\x87S\xc2\xbd\xc3\xaa\xc3\x94H>~\x11F\xc2\xbe?D\x08>\xc2\xb2\x13\x0e\xc2\xbf\xc2\xaaa*>\xc3\x9a\xc2\x9c\xc2\xb5\xc2\xbe\xc3\x81\xc3\xa2H>l\xc3\xb7i>\xc3\xbd\xc2\xa2\xc3\xb6=&p\xc2\x86=\xc2\xb6\xc3\xbdt<\xc2\xb6R\xc3\x80\xc2\xbe'\xc2\xab\x00\xc2\xbe\xc2\x87;V\xc2\xbd\xc3\x86\xc2\x88\xc2\xa9=\x1f\xc2\xbf\xc3\x8a\xc2\xbe\xc3\x9e\xc3\x8c\xc3\xa8<\xc3\x91\xc3\xad\xc3\x88=\xc2\xa1\x07\xc2\xbd>\xc2\x85&N\xc2\xbe\xc3\x9f\xc3\x8d\xc2\x96>Ae\xc3\xa3<\xc2\x9a\xc3\xb8\xc2\x9e:&\xc2\x99R?\xc3\xa82\x18\xc2\xbe\xc2\xb1\x0e'\xc2\xbeq\x11h\x05\x86q\x12Rq\x13tq\x14b." |
Note, however, that if the function you're apply
ing doesn't have type hints (as in the example here), you'll need to specify the output column type explicitly.
Column Properties
Some ColumnRef
expressions have additional useful properties. A media column (image, video, audio, or document) has the following two properties:
localpath
: the media location on the local filesystemfileurl
: the original URL where the media resides (could be the same aslocalpath
)
t.select(t.image, t.image.localpath).head(5)
Any computed column will have two additional properties, errortype
and errormsg
. These properties will usually be None
. However, if the computed column was created with on_error='ignore'
and an exception was encountered during column execution, then the properties will contain additional information about the exception.
To demonstrate this feature, we're going to deliberately trigger an exception in a computed column. The images in our example table are black and white, meaning they have only one color channel. If we try to extract a channel other than channel number 0
, we'll get an exception. Ordinarily when we call add_column
, the exception is raised and the add_column
operation is aborted.
t.add_column(channel=t.image.getchannel(1))
Computing cells: 32%|βββββββββββββ | 16/50 [00:00<00:00, 905.19 cells/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exprs/row_builder.py:314, in RowBuilder.eval(self, data_row, ctx, profile, ignore_errors)
313 start_time = time.perf_counter()
--> 314 expr.eval(data_row, self)
315 if profile is not None:
File ~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/exprs/function_call.py:429, in FunctionCall.eval(self, data_row, row_builder)
427 if isinstance(self.fn, func.CallableFunction) and not self.fn.is_batched:
428 # optimization: avoid additional level of indirection we'd get from calling Function.exec()
--> 429 data_row[self.slot_idx] = self.fn.py_fn(*args, **kwargs)
430 elif self.is_window_fn_call:
File /opt/miniconda3/envs/pxt/lib/python3.9/site-packages/PIL/Image.py:2555, in Image.getchannel(self, channel)
2553 raise ValueError(msg) from e
-> 2555 return self._new(self.im.getband(channel))
ValueError: band index out of range
The above exception was the direct cause of the following exception:
Error Traceback (most recent call last)
Cell In[25], line 1
----> 1 t.add_column(channel=t.image.getchannel(1))
File ~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table.py:414, in Table.add_column(self, type, stored, print_stats, on_error, **kwargs)
412 new_col = self._create_columns({col_name: col_schema})[0]
413 self._verify_column(new_col, set(self._schema.keys()), set(self._query_names))
--> 414 status = self._tbl_version.add_column(new_col, print_stats=print_stats, on_error=on_error)
415 FileCache.get().emit_eviction_warnings()
416 return status
File ~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table_version.py:503, in TableVersion.add_column(self, col, print_stats, on_error)
501 self.schema_version = self.version
502 with Env.get().engine.begin() as conn:
--> 503 status = self._add_columns([col], conn, print_stats=print_stats, on_error=on_error)
504 _ = self._add_default_index(col, conn)
505 self._update_md(time.time(), conn, preceding_schema_version=preceding_schema_version)
File ~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table_version.py:578, in TableVersion._add_columns(self, cols, conn, print_stats, on_error)
576 # we need to re-initialize the sqlalchemy schema
577 self.store_tbl.create_sa_tbl()
--> 578 raise exc
579 finally:
580 plan.close()
File ~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/catalog/table_version.py:561, in TableVersion._add_columns(self, cols, conn, print_stats, on_error)
559 plan.open()
560 try:
--> 561 num_excs = self.store_tbl.load_column(col, plan, value_expr_slot_idx, conn, on_error)
562 except sql.exc.DBAPIError as exc:
563 # Wrap the DBAPIError in an excs.Error to unify processing in the subsequent except block
564 raise excs.Error(f'SQL error during execution of computed column `{col.name}`:\n{exc}') from exc
File ~/Dropbox/workspace/pixeltable/pixeltable/pixeltable/store.py:265, in StoreBase.load_column(self, col, exec_plan, value_expr_slot_idx, conn, on_error)
263 value_exc = result_row.get_exc(value_expr_slot_idx)
264 if on_error == 'abort':
--> 265 raise excs.Error(
266 f'Error while evaluating computed column `{col.name}`:\n{value_exc}'
267 ) from value_exc
268 # we store a NULL value and record the exception/exc type
269 error_type = type(value_exc).__name__
Error: Error while evaluating computed column `channel`:
band index out of range
But if we use on_error='ignore'
, the exception will be logged in the column properties instead.
t.add_column(channel=t.image.getchannel(1), on_error='ignore')
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:00<00:00, 1718.16 cells/s]
Added 50 column values with 50 errors.
UpdateStatus(num_rows=50, num_computed_values=50, num_excs=50, updated_cols=[], cols_with_excs=['mnist_tmp_69154325.channel'])
Notice that the update status informs us that there were 50 errors. If we query the table, we see that the column contains only None
values, but the errortype
and errormsg
fields contain details of the error.
t.select(t.image, t.channel, t.channel.errortype, t.channel.errormsg).head(5)
More details on Pixeltable's error handling can be found in the Working with External Files how-to guide.
The Pixeltable Type System
We've seen that every column and every expression in Pixeltable has an associated Pixeltable type. In this section, we'll briefly survey the various Pixeltable types and their uses.
Here are all the supported types and their corresponding Python types:
Pixeltable type | Python type |
---|---|
pxt.String | str |
pxt.Int | int |
pxt.Float | float |
pxt.Bool | bool |
pxt.Timestamp | datetime.datetime |
pxt.Json | ** |
pxt.Array | np.ndarray |
pxt.Image | PIL.Image.Image |
pxt.Video | str |
pxt.Audio | str |
pxt.Document | str |
The Python type is what you'll get back if you query an expression of the given Pixeltable type. For pxt.Json
, it can be any of str
, int
, float
, bool
, list
, or dict
.
pxt.Audio
, pxt.Video
, and pxt.Document
all correspond to the Python type str
. This is because those types are represented by file paths that reference the media in question. When you query for, say, t.select(t.video_col)
, you're guaranteed to get a file path on the local filesystem (Pixeltable will download and cache a local copy of the video if necessary to ensure this). If you want the original URL, use t.video_col.fileurl
instead.
Several types can be specialized to constrain the allowable data in a column.
pxt.Image
can be specialized with a resolution and/or an image mode:pxt.Image[(300,200)]
- images with width 300 and height 200pxt.Image['RGB']
- images with mode'RGB'
; see the PIL Documentation for the full listpxt.Image[(300,200), 'RGB']
- combines the above constraints
pxt.Array
must be specialized with a shape and a dtype:pxt.Array[(64,64,3), pxt.Float]
- 3-dimensional arrays with dtypepxt.Float
and 64x64x3 shape
If we look at the structure of our table now, we see examples of specialized image and array types.
t
Column Name | Type | Computed With |
---|---|---|
image | Image | |
label | String | |
classification | Json | vit_for_image_classification(image, model_id='farleyknight-org-username/vit-base-mnist') |
pred_label | Json | classification.label_text[0] |
clip | Array[(512,), Float] | clip_image(image, model_id='openai/clip-vit-base-patch32') |
channel | Image['L'] | image.getchannel(1) |
t.clip
has type pxt.Array[(512,), pxt.Float]
, since the output of the embedding is always a 1x512 array. t.channel
has type Image['L']
, since it's always an 'L'
mode (1-channel) image.
pxt.Image
by itself to mean "any image, without constraints", but numerical arrays must always specify a shape and a dtype; pxt.Array
by itself will raise an error.
(512,)
or (64,64,3)
. A None
may be used in place of an integer to indicate an unconstrained size for that dimension, as in (None,None,3)
(3-dimensional array with two unconstrained dimensions), or simply (None,)
(unconstrained 1-dimensional array).
Updated about 2 hours ago