Runtime -> Change runtime type menu item at the top, then select the
GPU radio button and click on Save.
Main takeaways: * Indexing in Pixeltable is declarative - you create an
index on a column and supply the embedding functions you want to use
(for inserting data into the index as well as lookups) - Pixeltable
maintains the index in response to any kind of update of the indexed
table (i.e., insert()/update()/delete()) * Perform index lookups
with the similarity() pseudo-function, in combination with the
order_by() and limit() clauses
To make this concrete, let’s create a table of images with the
create_table()
function. We’re also going to add some columns to demonstrate combining
similarity search with other predicates.
Creating an index
To create and populate an index, we callTable.add_embedding_index()
and tell it which UDF or UDFs to use to create embeddings. That
definition is persisted as part of the table’s metadata, which allows
Pixeltable to maintain the index in response to updates to the table.
Any embedding UDF can be used for the index. For this example, we’re
going to use a
CLIP
model, which has built-in support in Pixeltable under the
pixeltable.functions.huggingface
package. As an alternative, you could use an online service such as
OpenAI (see
pixeltable.functions.openai),
or create your own embedding UDF with custom code (we’ll see how to do
this below).
Because we’re adding an index to an image column, the UDF we specify
must be able to handle images. In fact, CLIP models are multimodal:
they can handle both text and images, which is useful for doing lookups
against the index.
add_embedding_index() is the name of the column
being indexed; the embed parameter specifies the relevant embedding.
Notice the notation we used:
clip is a general-purpose UDF that can accept any CLIP model available
in the Hugging Face model repository. To define an embedding, however,
we need to provide a specific embedding function to
add_embedding_index(): a function that is not parameterized on
model_id. The .using(model_id=...) syntax tells Pixeltable to
specialize the clip UDF by fixing the model_id parameter to the
specific value 'openai/clip-vit-base-patch32'.
If you’re familiar with functional programming concepts, you might
recognize .using() as a partial function operator.
It’s a general operator that can be applied to any UDF (not just
embedding functions), transforming a UDF with n parameters into
one with k parameters by fixing the values of n-k
of its arguments. Python has something similar in the
functools package: the
functools.partial()
operator.
add_embedding_index() provides a few other optional parameters:
idx_name: optional name for the index, which needs to be unique for the table; a default name is created if this isn’t provided explicitlymetric: the metric to use to compute the similarity of two embedding vectors; one of:'cosine': cosine distance (default)'ip': inner product'l2': L2 distance
Using the index in queries
To take advantage of an embedding index when querying a table, we use thesimilarity() pseudo-function, which is invoked as a method on the
indexed column, in combination with the
order_by()
and
limit()
clauses. First, we’ll get a sample image from the table:

similarity() pseudo-function as a method on the
indexed column and apply order_by() and limit(). We used the default
cosine distance when we created the index, so we’re going to order by
descending similarity (order_by(..., asc=False)):
| id | img | similarity |
|---|---|---|
| 6 | 1. | |
| 3 | 0.607 |
sample_img (which we already know has perfect similarity with itself):
| id | img | similarity |
|---|---|---|
| 3 | 0.607 | |
| 7 | 0.551 |
Index updates
In Pixeltable, each index is kept up-to-date automatically in response to changes to the indexed table. To illustrate this, let’s insert a few more rows:| id | img | similarity |
|---|---|---|
| 6 | 1. | |
| 19 | 0.617 |
Similarity search on different types
Because CLIP models are multimodal, we can also do lookups by text.| id | img | similarity |
|---|---|---|
| 13 | 0.274 | |
| 9 | 0.24 |
Creating multiple indexes on a single column
We can create multiple embedding indexes on the same column, utilizing different embedding models. In order to use a specific index in a query, we need to assign it a name and then use that name in the query. To illustrate this, let’s create a table with text (taken from the Wikipedia article on Pablo Picasso):add_embedding_index(),
we now specify the index name (idx_name) directly. If it is not
specified, Pixeltable will assign a name (such as idx0).
similarity() with the idx
parameter:
| text | similarity |
|---|---|
| One of the most influential artists of the 20th century, he is known for co-founding the Cubist movement, the invention of constructed sculpture,[8][9] the co-invention of collage, and for the wide variety of styles that he helped develop and explore. | 0.443 |
| While the names of many of his later periods are debated, the most commonly accepted periods in his work are the Blue Period (1901–1904), the Rose Period (1904–1906), the African-influenced Period (1907–1909), Analytic Cubism (1909–1912), and Synthetic Cubism (1912–1919), also referred to as the Crystal period. | 0.426 |
Using a UDF for a custom embedding
The above examples show how to use any model in the Hugging FaceCLIP
or sentence_transformer model families, and essentially the same
pattern can be used for any other embedding with built-in Pixeltable
support, such as OpenAI embeddings. But what if you want to adapt a new
model family that doesn’t have built-in support in Pixeltable? This can
be done by writing a custom Pixeltable UDF.
In the following example, we’ll write a simple UDF to use the
BERT
model built on TensorFlow. First we install the necessary dependencies.
small_bert, the variant we’ll be using). If we were writing an image
embedding UDF, the input would have type PIL.Image.Image rather than
str. The UDF is straightforward, loading the model and evaluating it
against the input, with a minor data conversion on either side of the
model invocation.
bert_idx.
| text | similarity |
|---|---|
| Picasso's output, especially in his early career, is often periodized. | 0.699 |
| During the first decade of the 20th century, his style changed as he experimented with different theories, techniques, and ideas. | 0.697 |
- Cache the model: the current version calls
hub.load()on every UDF invocation. In a real application, we’d want to instantiate the model just once, then reuse it on subsequent UDF calls. - Batch our inputs: we’d use Pixeltable’s batching capability to ensure we’re making efficient use of the model. Batched UDFs are described in depth in the User-Defined Functions how-to guide.
bert_idx seem sluggish;
that’s why!
Deleting an index
To delete an index, callTable.drop_embedding_index(): -
specify the idx_name parameter if you have multiple indices -
otherwise the column_name parameter is sufficient
Given that we have several embedding indices, we’ll specify which index
to drop: