Embedding and Vector Indexes

A hands-on guide to creating, managing, and utilizing vector/embedding indexes in Pixeltable.

Kaggle Colab

Working with Embedding/Vector Indexes

If you are running this tutorial in Colab:
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the Runtime -> Change runtime type menu item at the top, then select the GPU radio button and click on Save.

Main takeaways:

  • Indexing in Pixeltable is declarative
    • you create an index on a column and supply the embedding functions you want to use (for inserting data into the index as well as lookups)
    • Pixeltable maintains the index in response to any kind of update of the indexed table (i.e., insert()/update()/delete())
  • Perform index lookups with the similarity() pseudo-function, in combination with the order_by() and limit() clauses

To make this concrete, let's create a table of images with the create_table() function.
We're also going to add some additional columns, to demonstrate combining similarity search with other predicates.

%pip install -qU pixeltable transformers sentence_transformers
import pixeltable as pxt

# Delete the `indices_demo` directory and its contents, if it exists
pxt.drop_dir('indices_demo', force=True)

# Create the directory and table to use for the demo
pxt.create_dir('indices_demo')
schema = {
    'id': pxt.Int,
    'img': pxt.Image,
}
imgs = pxt.create_table('indices_demo.img_tbl', schema)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `indices_demo`.
Created table `img_tbl`.

We start out by inserting 10 rows:

img_urls = [
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000030.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000034.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000042.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000049.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000057.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000061.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000063.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000064.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000069.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000071.jpg',
]
imgs.insert({'id': i, 'img': url} for i, url in enumerate(img_urls))
Computing cells:   0%|                                                   | 0/20 [00:00<?, ? cells/s]
Inserting rows into `img_tbl`: 10 rows [00:00, 4426.71 rows/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 20/20 [00:00<00:00, 647.69 cells/s]
Inserted 10 rows with 0 errors.

UpdateStatus(num_rows=10, num_computed_values=20, num_excs=0, updated_cols=[], cols_with_excs=[])

For the sake of convenience, we're storing the images as external URLs, which are cached transparently by Pixeltable. For details on working with external media files, see Working with External Files.

Creating an index

To create and populate an index, we call Table.add_embedding_index() and tell it which functions to use to create embeddings. That definition is persisted as part of the table's metadata, which allows Pixeltable to maintain the index in response to updates to the table.

We're adding an index to an image column, so we must specify an image embedding in the call to add_embedding_index(). The image embedding will be used to compute the index. We also have the option to specify a string embedding; this can be useful if we want to later do similarity lookups against text input (we'll see how this works shortly).

You can use any embedding UDF for the index. For this example, we're going to use a CLIP model, which has built-in support in Pixeltable under the pixeltable.functions.huggingface package. As an alternative, you could use an online service such as OpenAI (see pixeltable.functions.openai), or create your own embedding UDF with custom code (we'll see how to do this below).

from pixeltable.functions.huggingface import clip_image, clip_text
import PIL.Image

# create embedding index on the 'img' column
imgs.add_embedding_index(
    'img',
    string_embed=clip_text.using(model_id='openai/clip-vit-base-patch32'),
    image_embed=clip_image.using(model_id='openai/clip-vit-base-patch32')
)
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:03<00:00,  2.78 cells/s]

The first parameter of add_embedding_index() is the name of the column being indexed; the string_embed and image_embed parameters specify the relevant embeddings. Notice the notation we used for them:

clip_text.using(model_id='openai/clip-vit-base-patch32')

clip_text is a general-purpose UDF that can accept any CLIP model available in the Hugging Face model repository. To define an embedding, however, we need to provide a specific embedding function to add_embedding_index(): a function that is not parameterized on model_id. The .using(model_id=...) syntax tells Pixeltable to specialize the clip_text UDF by fixing the model_id parameter to the specific value 'openai/clip-vit-base-patch32'.

If you're familiar with functional programming concepts, you might recognize .using() as a partial function operator. It's a general operator that can be applied to any UDF (not just embedding functions), transforming a UDF with n parameters into one with k parameters by fixing the values of n-k arguments. Python has something similar in the functools package: the functools.partial() operator.

add_embedding_index() provides a few other optional parameters:

  • idx_name: optional name for the index, which needs to be unique for the table; a default name is created if this isn't provided explicitly
  • metric: the metric to use to compute the similarity of two embedding vectors; one of:
    • 'cosine': cosine distance (default)
    • 'ip': inner product
    • 'l2': L2 distance

If desired, you can create multiple indexes on the same column, using different embedding functions. This can be useful to evaluate the effectiveness of different embedding functions side-by-side, or to use embedding functions tailored to specific use cases. In that case, you can provide explicit names for those indexes and then reference them during queries. We'll illustrate that later with an example.

Using the index in queries

To take advantage of an embedding index when querying a table, we use the similarity() pseudo-function, which is invoked as a method on the indexed column, in combination with the order_by() and limit() clauses. First, we'll get a sample image from the table:

# retrieve the 'img' column of some row as a PIL.Image.Image
sample_img = imgs.select(imgs.img).collect()[6]['img']
sample_img

We then call the similarity() pseudo-function as a method on the indexed column and apply order_by() and limit(). We used the default cosine distance when we created the index, so we're going to order by descending similarity (order_by(..., asc=False)):

sim = imgs.img.similarity(sample_img)
res = (
    imgs.order_by(sim, asc=False)  # Order by descending similarity
    .limit(2)  # Limit number of results to 2
    .select(imgs.id, imgs.img, sim)
    .collect()  # Retrieve results now
)
res


We can combine nearest-neighbor/similarity search with standard predicates. Here's the same query, but filtering out the selected sample_img (which we already know has perfect similarity with itself):

res = (
    imgs.order_by(sim, asc=False)
    .where(imgs.id != 6)  # Additional clause
    .limit(2)
    .select(imgs.id, imgs.img, sim)
    .collect()
)
res


Index updates

In Pixeltable, each index is kept up-to-date automatically in response to changes to the indexed table.

To illustrate this, let's insert a few more rows:

more_img_urls = [
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000080.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000090.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000106.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000108.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000139.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000285.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000632.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000724.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000776.jpg',
    'https://raw.github.com/pixeltable/pixeltable/release/docs/resources/images/000000000785.jpg',
]
imgs.insert({'id': 10 + i, 'img': url} for i, url in enumerate(more_img_urls))
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:00<00:00, 127.26 cells/s]
Inserting rows into `img_tbl`: 10 rows [00:00, 1501.29 rows/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [00:00<00:00, 110.92 cells/s]
Inserted 10 rows with 0 errors.

UpdateStatus(num_rows=10, num_computed_values=30, num_excs=0, updated_cols=[], cols_with_excs=[])

When we now re-run the initial similarity query, we get a different result:

sim = imgs.img.similarity(sample_img)
res = (
    imgs.order_by(sim, asc=False)
    .limit(2)
    .select(imgs.id, imgs.img, sim)
    .collect()
)
res


Similarity search on different types

Since we specified a string_embed function when we created the index, we can also do lookups by text, taking full advantage of multimodal embeddings.

sim = imgs.img.similarity('train')  # String lookup
res = (
    imgs.order_by(sim, asc=False)
    .limit(2)
    .select(imgs.id, imgs.img, sim)
    .collect()
)
res


Creating multiple indexes on a single column

We can create multiple embedding indexes on the same column, utilizing different embedding models. In order to use a specific index in a query, we need to assign it a name and then use that name in the query.

To illustrate this, let's create a table with text (taken from the Wikipedia article on Pablo Picasso):

txts = pxt.create_table('indices_demo.text_tbl', {'text': pxt.String})
sentences = [
    "Pablo Ruiz Picasso (25 October 1881 – 8 April 1973) was a Spanish painter, sculptor, printmaker, ceramicist, and theatre designer who spent most of his adult life in France.",
    "One of the most influential artists of the 20th century, he is known for co-founding the Cubist movement, the invention of constructed sculpture,[8][9] the co-invention of collage, and for the wide variety of styles that he helped develop and explore.",
    "Among his most famous works are the proto-Cubist Les Demoiselles d'Avignon (1907) and the anti-war painting Guernica (1937), a dramatic portrayal of the bombing of Guernica by German and Italian air forces during the Spanish Civil War.",
    "Picasso demonstrated extraordinary artistic talent in his early years, painting in a naturalistic manner through his childhood and adolescence.",
    "During the first decade of the 20th century, his style changed as he experimented with different theories, techniques, and ideas.",
    "After 1906, the Fauvist work of the older artist Henri Matisse motivated Picasso to explore more radical styles, beginning a fruitful rivalry between the two artists, who subsequently were often paired by critics as the leaders of modern art.",
    "Picasso's output, especially in his early career, is often periodized.",
    "While the names of many of his later periods are debated, the most commonly accepted periods in his work are the Blue Period (1901–1904), the Rose Period (1904–1906), the African-influenced Period (1907–1909), Analytic Cubism (1909–1912), and Synthetic Cubism (1912–1919), also referred to as the Crystal period.",
    "Much of Picasso's work of the late 1910s and early 1920s is in a neoclassical style, and his work in the mid-1920s often has characteristics of Surrealism.",
    "His later work often combines elements of his earlier styles.",
]
txts.insert({'text': s} for s in sentences)
Created table `text_tbl`.
Inserting rows into `text_tbl`: 10 rows [00:00, 2815.35 rows/s]
Inserted 10 rows with 0 errors.

UpdateStatus(num_rows=10, num_computed_values=10, num_excs=0, updated_cols=[], cols_with_excs=[])

When calling add_embedding_index(), we now specify the index name (idx_name) directly. If it is not specified, Pixeltable will assign a name (such as idx0).

from pixeltable.functions.huggingface import sentence_transformer

txts.add_embedding_index(
    'text',
    idx_name='minilm_idx',
    string_embed=sentence_transformer.using(model_id='sentence-transformers/all-MiniLM-L12-v2')
)
txts.add_embedding_index(
    'text',
    idx_name='e5_idx',
    string_embed=sentence_transformer.using(model_id='intfloat/e5-large-v2')
)
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:01<00:00,  8.35 cells/s]
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:01<00:00,  6.72 cells/s]

To do a similarity query, we now call similarity() with the idx parameter:

sim = txts.text.similarity('cubism', idx='minilm_idx')
res = txts.order_by(sim, asc=False).limit(2).select(txts.text, sim).collect()
res
text similarity
One of the most influential artists of the 20th century, he is known for co-founding the Cubist movement, the invention of constructed sculpture,[8][9] the co-invention of collage, and for the wide variety of styles that he helped develop and explore. 0.443
While the names of many of his later periods are debated, the most commonly accepted periods in his work are the Blue Period (1901–1904), the Rose Period (1904–1906), the African-influenced Period (1907–1909), Analytic Cubism (1909–1912), and Synthetic Cubism (1912–1919), also referred to as the Crystal period. 0.426

Using a UDF for a custom embedding

The above examples show how to use any model in the Hugging Face CLIP or sentence_transformer model families, and essentially the same pattern can be used for any other embedding with built-in Pixeltable support, such as OpenAI embeddings. But what if you want to adapt a new model family that doesn't have built-in support in Pixeltable? This can be done by writing a custom Pixeltable UDF.

In the following example, we'll write a simple UDF to use the BERT model built on TensorFlow. First we install the necessary dependencies.

%pip install -qU tensorflow tensorflow-hub tensorflow-text

Text embedding UDFs must always take a string as input, and return a 1-dimensional numpy array of fixed dimension (512 in the case of small_bert, the variant we'll be using). If we were writing an image embedding UDF, the input would have type PIL.Image.Image rather than str. The UDF is straightforward, loading the model and evaluating it against the input, with a minor data conversion on either side of the model invocation.

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text  # Necessary to ensure BERT dependencies are loaded
import pixeltable as pxt

@pxt.udf
def bert(input: str) -> pxt.Array[(512,), pxt.Float]:
    """Computes text embeddings using the small_bert model."""
    preprocessor = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
    bert_model = hub.load('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2')
    tensor = tf.constant([input])  # Convert the string to a tensor
    result = bert_model(preprocessor(tensor))['pooled_output']
    return result.numpy()[0, :]
txts.add_embedding_index(
    'text',
    idx_name='bert_idx',
    string_embed=bert
)
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [00:17<00:00,  1.72s/ cells]

Here's the output of our sample query run against bert_idx.

sim = txts.text.similarity('cubism', idx='bert_idx')
res = txts.order_by(sim, asc=False).limit(2).select(txts.text, sim).collect()
res
text similarity
Picasso's output, especially in his early career, is often periodized. 0.699
During the first decade of the 20th century, his style changed as he experimented with different theories, techniques, and ideas. 0.697

Our example UDF is very simple, but it would perform poorly in a production setting. To make our UDF production-ready, we'd want to do two things:

  • Cache the model: the current version calls hub.load() on every UDF invocation. In a real application, we'd want to instantiate the model just once, then reuse it on subsequent UDF calls.
  • Batch our inputs: we'd use Pixeltable's batching capability to ensure we're making efficient use of the model. Batched UDFs are described in depth in the User-Defined Functions how-to guide.

You might have noticed that the updates to bert_idx seem sluggish; that's why!

Deleting an index

To delete an index, call Table.drop_embedding_index():

  • specify the idx_name parameter if you have multiple indices
  • otherwise the column_name parameter is sufficient

Given that we have several embedding indices, we'll specify which index to drop:

txts.drop_embedding_index(idx_name='e5_idx')