| input |
|---|
| Hello, world! |
| You can do a lot with Pixeltable UDFs. |
What is a UDF?
A Pixeltable UDF is just a Python function that is marked with the@pxt.udf decorator.
add_one would be an
ordinary Python function that operates on integers. Adding @pxt.udf
converts it into a Pixeltable function that operates on columns of
integers. The decorated function can then be used directly to define
computed columns; Pixeltable will orchestrate its execution across all
the input data.
For our first working example, let’s do something slightly more
interesting: write a function to extract the longest word from a
sentence. (If there are ties for the longest word, we choose the first
word among those ties.) In Python, that might look something like this:
longest_word Python function isn’t a Pixeltable UDF (yet); it
operates on individual strings, not columns of strings. Adding the
decorator turns it into a UDF:
| input | longest_word |
|---|---|
| Hello, world! | Hello, |
| You can do a lot with Pixeltable UDFs. | Pixeltable |
| input | longest_word |
|---|---|
| Hello, world! | Hello, |
| You can do a lot with Pixeltable UDFs. | Pixeltable |
| Pixeltable updates tables incrementally. | incrementally. |
strip_punctuation parameter
from our UDF. (We could alternatively drop the first column before
adding the new one, but for purposes of this tutorial it’s convenient to
see how Pixeltable executes both variants side-by-side.) Note how
columns such as t.input and constants such as True can be freely
intermixed as arguments to the UDF.
| input | longest_word | longest_word_2 |
|---|---|---|
| Hello, world! | Hello, | Hello |
| You can do a lot with Pixeltable UDFs. | Pixeltable | Pixeltable |
| Pixeltable updates tables incrementally. | incrementally. | incrementally |
Types in UDFs
You might have noticed that thelongest_word UDF has type hints in
its signature.
sentence parameter, strip_punctuation parameter, and return
value all have explicit types (str, bool, and str respectively).
In general Python code, type hints are usually optional. But Pixeltable
is a database system: everything in Pixeltable must have a type. And
since Pixeltable is also an orchestrator - meaning it sets up workflows
and computed columns before executing them - these types need to be
known in advance. That’s the reasoning behind a fundamental principle of
Pixeltable UDFs: - Type hints are required.
You can turn almost any Python function into a Pixeltable UDF, provided
that it has type hints, and provided that Pixeltable supports the types
that it uses. The most familiar types that you’ll use in UDFs are: -
int - float - str - list (can optionally be parameterized, e.g.,
list[str]) - dict (can optionally be parameterized, e.g.,
dict[str, int]) - PIL.Image.Image
In addition to these standard Python types, Pixeltable also recognizes
various kinds of arrays, audio and video media, and documents.
Local and Module UDFs
Thelongest_word UDF that we defined above is a local UDF: it was
defined directly in our notebook, rather than in a module that we
imported. Many other UDFs, including all of Pixeltable’s built-in
functions, are defined in modules. We encountered a few of these in the
Pixeltable Basics tutorial: the huggingface.detr_for_object_detection
and openai.vision functions. (Although these are built-in functions,
they behave the same way as UDFs, and in fact they’re defined the same
way under the covers.)
There is an important difference between the two. When you add a module
UDF such as openai.vision to a table, Pixeltable stores a reference
to the corresponding Python function in the module. If you later restart
your Python runtime and reload Pixeltable, then Pixeltable will
re-import the module UDF when it loads the computed column. This means
that any code changes made to the UDF will be picked up at that time,
and the new version of the UDF will be used in any future execution.
Conversely, when you add a local UDF to a table, the entire code for
the UDF is serialized and stored in the table. This ensures that if you
restart your notebook kernel (say), or even delete the notebook
entirely, the UDF will continue to function. However, it also means that
if you modify the UDF code, the updated logic will not be reflected in
any existing Pixeltable columns.
To see how this works in practice, let’s modify our longest_word UDF
so that if strip_punctuation is True, then we remove only a single
punctuation mark from the end of each word.
| input | longest_word | longest_word_2 |
|---|---|---|
| Hello, world! | Hello, | Hello |
| You can do a lot with Pixeltable UDFs. | Pixeltable | Pixeltable |
| Pixeltable updates tables incrementally. | incrementally. | incrementally |
| Let's check that it still works. | works. | check |
longest_word UDF,
Pixeltable will use the updated version.
| input | longest_word | longest_word_2 | longest_word_3 |
|---|---|---|---|
| Hello, world! | Hello, | Hello | Hello |
| You can do a lot with Pixeltable UDFs. | Pixeltable | Pixeltable | Pixeltable |
| Pixeltable updates tables incrementally. | incrementally. | incrementally | incrementally |
| Let's check that it still works. | works. | check | Let's |
Batching
Pixeltable provides several ways to optimize UDFs for better performance. One of the most common is batching, which is particularly important for UDFs that involve GPU operations. Ordinary UDFs process one row at a time, meaning the UDF will be invoked exactly once per row processed. Conversely, a batched UDF processes several rows at a time; the specific number is user-configurable. As an example, let’s modify ourlongest_word UDF to take a batched
parameter. Here’s what it looks like:
batch_size=16 has been
added to the @pxt.udf decorator, specifying the batch size; - The
sentences parameter has changed from str to Batch[str]; - The
return type has also changed from str to Batch[str]; and - Instead
of processing a single sentence, the UDF is processing a Batch of
sentences and returning the result Batch.
What exactly is a Batch[str]? Functionally, it’s simply a list[str],
and you can use it exactly like a list[str] in any Python code. The
only difference is in the type hint; a type hint of Batch[str] tells
Pixeltable, “My data consists of individual strings that I want you to
process in batches”. Conversely, a type hint of list[str] would mean,
“My data consists of lists of strings that I want you to process one
at a time”.
Notice that the strip_punctuation parameter is not wrapped in a
Batch type. This because strip_punctuation controls the behavior of
the UDF, rather than being part of the input data. When we use the
batched longest_word UDF, the strip_punctuation parameter will
always be a constant, not a column.
Let’s put the new, batched UDF to work.
| input | longest_word | longest_word_2 | longest_word_3 | longest_word_3_batched |
|---|---|---|---|---|
| Hello, world! | Hello, | Hello | Hello | Hello |
| You can do a lot with Pixeltable UDFs. | Pixeltable | Pixeltable | Pixeltable | Pixeltable |
| Pixeltable updates tables incrementally. | incrementally. | incrementally | incrementally | incrementally |
| Let's check that it still works. | works. | check | Let's | Let's |
longest_word_3_batched column is
identical to the longest_word_3 column. Under the covers, though,
Pixeltable is orchestrating execution in batches of 16. That probably
won’t have much performance impact on our toy example, but for GPU-bound
computations such as text or image embeddings, it can make a substantial
difference.
UDAs (Aggregate UDFs)
Ordinary UDFs are always one-to-one on rows: each row of input generates one UDF output value. Functions that aggregate data, conversely, are many-to-one, and in Pixeltable they are represented by a related abstraction, the UDA (User-Defined Aggregate). Pixeltable has a number of built-in UDAs; if you’ve worked through the Fundamentals tutorial, you’ll have already encountered a few of them, such assum and count. In this section, we’ll show how to define
your own custom UDAs. For demonstration purposes, let’s start by
creating a table containing all the integers from 0 to 49.
sum aggregate,
we’d do it like this:
| sum |
|---|
| 1225 |
n // 10 (corresponding to the tens
digit of each integer) and sum each group:
| col_0 | sum |
|---|---|
| 0 | 45 |
| 1 | 145 |
| 2 | 245 |
| 3 | 345 |
| 4 | 445 |
pxt.Aggregator Python class and decorate it with the @pxt.uda
decorator, similar to what we did for UDFs. The subclass must implement
three methods:
__init__()- initializes the aggregator; can be used to parameterize aggregator behaviorupdate()- updates the internal state of the aggregator with a new valuevalue()- retrieves the current value held by the aggregator
cur_sum, which
holds a running total of the squares of all the values we’ve seen.
| sum_of_squares |
|---|
| 40425 |
| col_0 | sum_of_squares |
|---|---|
| 0 | 285 |
| 1 | 2185 |
| 2 | 6085 |
| 3 | 11985 |
| 4 | 19885 |