User-Defined Functions (UDFs)

Kaggle Colab

UDFs in Pixeltable

Pixeltable comes with a library of built-in functions and integrations, but sooner or later, you'll want to introduce some customized logic into your workflow. This is where Pixeltable's rich UDF (User-Defined Function) capability comes in. Pixeltable UDFs let you write code in Python, then directly insert your custom logic into Pixeltable expressions and computed columns. In this how-to guide, we'll show how to define UDFs, extend their capabilities, and use them in computed columns.

To start, we'll install the necessary dependencies, create a Pixeltable directory and table to experiment with, and add some sample data.

%pip install -qU pixeltable
Note: you may need to restart the kernel to use updated packages.
import pixeltable as pxt

# Create the directory and table
pxt.drop_dir('udf_demo', force=True)  # Ensure a clean slate for the demo
pxt.create_dir('udf_demo')
t = pxt.create_table('udf_demo.strings', {'input': pxt.String})

# Add some sample data
t.insert([{'input': 'Hello, world!'}, {'input': 'You can do a lot with Pixeltable UDFs.'}])
t.show()
Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `udf_demo`.
Created table `strings`.
Computing cells:   0%|                                                    | 0/2 [00:00<?, ? cells/s]
Inserting rows into `strings`: 2 rows [00:00, 1338.54 rows/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 593.46 cells/s]
Inserted 2 rows with 0 errors.
input
Hello, world!
You can do a lot with Pixeltable UDFs.
Inserted 2 rows with 0 errors.
input
Hello, world!
You can do a lot with Pixeltable UDFs.

What is a UDF?

A Pixeltable UDF is just a Python function that is marked with the @pxt.udf decorator.

@pxt.udf
def add_one(n: int) -> int:
    return n + 1

It's as simple as that! Without the decorator, add_one would be an ordinary Python function that operates on integers. Adding @pxt.udf converts it into a Pixeltable function that operates on columns of integers. The decorated function can then be used directly to define computed columns; Pixeltable will orchestrate its execution across all the input data.

For our first working example, let's do something slightly more interesting: write a function to extract the longest word from a sentence. (If there are ties for the longest word, we choose the first word among those ties.) In Python, that might look something like this:

import numpy as np

def longest_word(sentence: str, strip_punctuation: bool = False) -> str:
    words = sentence.split()
    if strip_punctuation:  # Remove non-alphanumeric characters from each word
        words = [''.join(filter(str.isalnum, word)) for word in words]
    i = np.argmax([len(word) for word in words])
    return words[i]
longest_word("Let's check that it works.", strip_punctuation=True)
'check'

The longest_word Python function isn't a Pixeltable UDF (yet); it operates on individual strings, not columns of strings. Adding the decorator turns it into a UDF:

@pxt.udf
def longest_word(sentence: str, strip_punctuation: bool = False) -> str:
    words = sentence.split()
    if strip_punctuation:  # Remove non-alphanumeric characters from each word
        words = [''.join(filter(str.isalnum, word)) for word in words]
    i = np.argmax([len(word) for word in words])
    return words[i]

Now we can use it to create a computed column. Pixeltable orchestrates the computation like it does with any other function, applying the UDF in turn to each existing row of the table, then updating incrementally each time a new row is added.

t.add_computed_column(longest_word=longest_word(t.input))
t.show()
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 370.78 cells/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 538.11 cells/s]
Added 2 column values with 0 errors.
input longest_word
Hello, world! Hello,
You can do a lot with Pixeltable UDFs. Pixeltable
t.insert(input='Pixeltable updates tables incrementally.')
t.show()
Computing cells:   0%|                                                    | 0/3 [00:00<?, ? cells/s]
Inserting rows into `strings`: 1 rows [00:00, 339.89 rows/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 420.57 cells/s]
Inserted 1 row with 0 errors.
input longest_word
Hello, world! Hello,
You can do a lot with Pixeltable UDFs. Pixeltable
Pixeltable updates tables incrementally. incrementally.

Inserting rows into strings: 1 rows [00:00, 1766.77 rows/s]

Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 648.97 cells/s]

Inserted 1 row with 0 errors.
input longest_word
Hello, world! Hello,
You can do a lot with Pixeltable UDFs. Pixeltable
Pixeltable updates tables incrementally. incrementally.

Oops, those trailing punctuation marks are kind of annoying. Let's add another column, this time using the handy strip_punctuation parameter from our UDF. (We could alternatively drop the first column before adding the new one, but for purposes of this tutorial it's convenient to see how Pixeltable executes both variants side-by-side.) Note how columns such as t.input and constants such as True can be freely intermixed as arguments to the UDF.

t.add_computed_column(
    longest_word_2=longest_word(t.input, strip_punctuation=True)
)
t.show()
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 404.05 cells/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 705.36 cells/s]
Added 3 column values with 0 errors.
input longest_word longest_word_2
Hello, world! Hello, Hello
You can do a lot with Pixeltable UDFs. Pixeltable Pixeltable
Pixeltable updates tables incrementally. incrementally. incrementally

Types in UDFs

You might have noticed that the longest_word UDF has type hints in its signature.

def longest_word(sentence: str, strip_punctuation: bool = False) -> str: ...

The sentence parameter, strip_punctuation parameter, and return value all have explicit types (str, bool, and str respectively). In general Python code, type hints are usually optional. But Pixeltable is a database system: everything in Pixeltable must have a type. And since Pixeltable is also an orchestrator - meaning it sets up workflows and computed columns before executing them - these types need to be known in advance. That's the reasoning behind a fundamental principle of Pixeltable UDFs:

  • Type hints are required.

You can turn almost any Python function into a Pixeltable UDF, provided that it has type hints, and provided that Pixeltable supports the types that it uses. The most familiar types that you'll use in UDFs are:

  • int
  • float
  • str
  • list (can optionally be parameterized, e.g., list[str])
  • dict (can optionally be parameterized, e.g., dict[str, int])
  • PIL.Image.Image

In addition to these standard Python types, Pixeltable also recognizes various kinds of arrays, audio and video media, and documents.

Local and Module UDFs

The longest_word UDF that we defined above is a local UDF: it was defined directly in our notebook, rather than in a module that we imported. Many other UDFs, including all of Pixeltable's built-in functions, are defined in modules. We encountered a few of these in the Pixeltable Basics tutorial: the huggingface.detr_for_object_detection and openai.vision functions. (Although these are built-in functions, they behave the same way as UDFs, and in fact they're defined the same way under the covers.)

There is an important difference between the two. When you add a module UDF such as openai.vision to a table, Pixeltable stores a reference to the corresponding Python function in the module. If you later restart your Python runtime and reload Pixeltable, then Pixeltable will re-import the module UDF when it loads the computed column. This means that any code changes made to the UDF will be picked up at that time, and the new version of the UDF will be used in any future execution.

Conversely, when you add a local UDF to a table, the entire code for the UDF is serialized and stored in the table. This ensures that if you restart your notebook kernel (say), or even delete the notebook entirely, the UDF will continue to function. However, it also means that if you modify the UDF code, the updated logic will not be reflected in any existing Pixeltable columns.

To see how this works in practice, let's modify our longest_word UDF so that if strip_punctuation is True, then we remove only a single punctuation mark from the end of each word.

@pxt.udf
def longest_word(sentence: str, strip_punctuation: bool = False) -> str:
    words = sentence.split()
    if strip_punctuation:
        words = [
            word if word[-1].isalnum() else word[:-1]
            for word in words
        ]
    i = np.argmax([len(word) for word in words])
    return words[i]

Now we see that Pixeltable continues to use the old definition, even as new rows are added to the table.

t.insert(input="Let's check that it still works.")
t.show()
Computing cells:   0%|                                                    | 0/5 [00:00<?, ? cells/s]
Inserting rows into `strings`: 1 rows [00:00, 301.10 rows/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 699.03 cells/s]
Inserted 1 row with 0 errors.
input longest_word longest_word_2
Hello, world! Hello, Hello
You can do a lot with Pixeltable UDFs. Pixeltable Pixeltable
Pixeltable updates tables incrementally. incrementally. incrementally
Let's check that it still works. works. check

Inserting rows into strings: 1 rows [00:00, 1552.87 rows/s]

Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 1254.65 cells/s]

Inserted 1 row with 0 errors.
input longest_word longest_word_2
Hello, world! Hello, Hello
You can do a lot with Pixeltable UDFs. Pixeltable Pixeltable
Pixeltable updates tables incrementally. incrementally. incrementally
Let's check that it still works. works. check

But if we add a new column that references the longest_word UDF, Pixeltable will use the updated version.

t.add_computed_column(
    longest_word_3=longest_word(t.input, strip_punctuation=True)
)
t.show()
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 568.18 cells/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 570.60 cells/s]
Added 4 column values with 0 errors.
input longest_word longest_word_2 longest_word_3
Hello, world! Hello, Hello Hello
You can do a lot with Pixeltable UDFs. Pixeltable Pixeltable Pixeltable
Pixeltable updates tables incrementally. incrementally. incrementally incrementally
Let's check that it still works. works. check Let's

The general rule is: changes to module UDFs will affect any future execution; changes to local UDFs will only affect new columns that are defined using the new version of the UDF.

Batching

Pixeltable provides several ways to optimize UDFs for better performance. One of the most common is batching, which is particularly important for UDFs that involve GPU operations.

Ordinary UDFs process one row at a time, meaning the UDF will be invoked exactly once per row processed. Conversely, a batched UDF processes several rows at a time; the specific number is user-configurable. As an example, let's modify our longest_word UDF to take a batched parameter. Here's what it looks like:

from pixeltable.func import Batch

@pxt.udf(batch_size=16)
def longest_word(sentences: Batch[str], strip_punctuation: bool = False) -> Batch[str]:
    results = []
    for sentence in sentences:
        words = sentence.split()
        if strip_punctuation:
            words = [
                word if word[-1].isalnum() else word[:-1]
                for word in words
            ]
        i = np.argmax([len(word) for word in words])
        results.append(words[i])
    return results

There are several changes:

  • The parameter batch_size=16 has been added to the @pxt.udf decorator, specifying the batch size;
  • The sentences parameter has changed from str to Batch[str];
  • The return type has also changed from str to Batch[str]; and
  • Instead of processing a single sentence, the UDF is processing a Batch of sentences and returning the result Batch.

What exactly is a Batch[str]? Functionally, it's simply a list[str], and you can use it exactly like a list[str] in any Python code. The only difference is in the type hint; a type hint of Batch[str] tells Pixeltable, "My data consists of individual strings that I want you to process in batches". Conversely, a type hint of list[str] would mean, "My data consists of lists of strings that I want you to process one at a time".

Notice that the strip_punctuation parameter is not wrapped in a Batch type. This because strip_punctuation controls the behavior of the UDF, rather than being part of the input data. When we use the batched longest_word UDF, the strip_punctuation parameter will always be a constant, not a column.

Let's put the new, batched UDF to work.

t.add_computed_column(
    longest_word_3_batched=longest_word(t.input, strip_punctuation=True)
)
t.show()
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 497.26 cells/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 736.52 cells/s]
Added 4 column values with 0 errors.
input longest_word longest_word_2 longest_word_3 longest_word_3_batched
Hello, world! Hello, Hello Hello Hello
You can do a lot with Pixeltable UDFs. Pixeltable Pixeltable Pixeltable Pixeltable
Pixeltable updates tables incrementally. incrementally. incrementally incrementally incrementally
Let's check that it still works. works. check Let's Let's

As expected, the output of the longest_word_3_batched column is identical to the longest_word_3 column. Under the covers, though, Pixeltable is orchestrating execution in batches of 16. That probably won't have much performance impact on our toy example, but for GPU-bound computations such as text or image embeddings, it can make a substantial difference.