Embedding and Vector Indexes
A hands-on guide to creating, managing, and utilizing vector/embedding indexes in Pixeltable to unlock the power of search for your ML tasks.
Working with Embedding/Vector IndexesΒΆ
If you are running this tutorial in Colab:
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the Runtime -> Change runtime type
menu item at the top, then select the GPU
radio button and click on Save
.
Main takeaways:
- Indexing in Pixeltable is declarative
- you create an index on a column and supply the embedding functions you want to use (for inserting data into the index as well as lookups)
- Pixeltable maintains the index in response to any kind of update of the indexed table (i.e.,
insert()
/update()
/delete()
)
- Perform index lookups with the
similarity()
pseudo-function, in combination with theorder_by()
andlimit()
clauses
To make this concrete, let's create a table of images with the create_table()
function.
We're also going to add some additional columns, to demonstrate combining similarity search with other predicates.
%pip install -qU pixeltable transformers sentence_transformers
import pixeltable as pxt
pxt.drop_dir('indices_demo', force=True) # Ensure a clean slate for the tutorial
pxt.create_dir('indices_demo')
schema = {
'img': pxt.ImageType(),
'bucket': pxt.IntType(),
'color': pxt.StringType(),
}
imgs = pxt.create_table('indices_demo.img_tbl', schema)
Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/home/marcel/.pixeltable/pgdata Created directory `indices_demo`. Created table `img_tbl`.
We start out by inserting 10 rows:
img_urls = [
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000030.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000034.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000042.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000049.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000057.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000061.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000063.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000064.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000069.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000071.jpg',
]
buckets = [0, 1]
colors = ['red', 'green', 'blue']
imgs.insert(
{'img': url, 'bucket': buckets[i % len(buckets)], 'color': colors[i % len(colors)]}
for i, url in enumerate(img_urls)
)
Computing cells: 80%|βββββββββββββββββββββββββββββββββββ | 8/10 [00:00<00:00, 12.25 cells/s] Inserting rows into `img_tbl`: 10 rows [00:00, 1035.17 rows/s] Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:00<00:00, 15.06 cells/s] Inserted 10 rows with 0 errors.
UpdateStatus(num_rows=10, num_computed_values=10, num_excs=0, updated_cols=[], cols_with_excs=[])
For the sake of convenience, we're storing the images as external URLs, which are cached transparently by Pixeltable. For details on working with external media files, see Working with External Files.
Creating an indexΒΆ
To create and populate an index, we call Table.add_embedding_index()
and tell it which functions to use to create string and image embeddings. That definition is persisted as part of the table's metadata, which allows Pixeltable to maintain the index in response to updates to the table.
We're going to use CLIP, for which the Hugging Face version is already available in Pixeltable as a function under pixeltable.functions.huggingface
. However, you can use any embedding function you like, via Pixeltable's UDF mechanism (which is described in detail our guide to user-defined functions.
from pixeltable.functions.huggingface import clip_image, clip_text
import PIL.Image
# create a udf that takes a single string, to use as an embedding function
@pxt.expr_udf
def str_embed(s: str):
return clip_text(s, model_id='openai/clip-vit-base-patch32')
# create a udf that takes a single image, to use as an embedding function
@pxt.expr_udf
def img_embed(img: PIL.Image.Image):
return clip_image(img, model_id='openai/clip-vit-base-patch32')
# create embedding index on the 'img' column
imgs.add_embedding_index('img', string_embed=str_embed, image_embed=img_embed)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:00<00:00, 11.44 cells/s]
The first parameter of add_embedding_index()
is the name of the column being indexed. Keyword parameters are:
string_embed
: PixeltableFunction
to compute embeddings forStringType
dataimage_embed
: PixeltableFunction
to compute embeddings forImageType
dataidx_name
: optional name for the index, which needs to be unique for the table; a default name is created if this isn't provided explicitlymetric
: which metric to use to compute the similarity of two embedding vectors; one ofcosine
: cosine distance (default)ip
: inner productl2
: L2 distance
If desired, you can create multiple indexes on the same column, using different embedding functions. This can be useful to evaluate the effectiveness of different embedding functions side-by-side, or to use embedding functions tailored to specific use cases. In that case, you can provide explicit names for those indexes and then reference those during queries. We'll illustrate that later with an example.
Using the index in queriesΒΆ
To take advantage of an embedding index when querying a table, we use the similarity()
pseudo-function, which is invoked as a method on the indexed column, in combination with the order_by()
and limit()
clauses. First, we'll get a sample image from the table:
# retrieve the 'img' column of some row as a PIL.Image.Image
sample_img = imgs.select(imgs.img).collect()[6]['img']
sample_img
We then call the similarity()
pseudo-function as a method on the indexed column and apply order_by()
and limit()
. We used the default cosine distance when we created the index, so we're going to order by descending similarity (order_by(..., asc=False)
):
sim = imgs.img.similarity(sample_img)
# use 'similarity()' in the order_by() clause and apply a limit in order to utilize the index
res = imgs.order_by(sim, asc=False).limit(2).select(imgs.img, imgs.bucket, imgs.color, sim=sim).collect()
res
img | bucket | color | sim |
---|---|---|---|
0 | red | 1. | |
1 | red | 0.607 |
We can combine nearest-neighbor/similarity search with standard predicates, like so:
res = imgs.order_by(sim, asc=False).limit(2).where(imgs.bucket != 0).collect()
res
img | bucket | color |
---|---|---|
1 | red | |
1 | green |
Index updatesΒΆ
In Pixeltable, each index is kept up-to-date automatically in response to changes to the indexed table.
To illustrate this, let's insert a few more rows:
more_img_urls = [
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000080.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000090.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000106.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000108.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000139.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000285.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000632.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000724.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000776.jpg',
'https://raw.github.com/pixeltable/pixeltable/release/docs/source/data/images/000000000785.jpg',
]
imgs.insert(
{'img': url, 'bucket': buckets[i % len(buckets)], 'color': colors[i % len(colors)]}
for i, url in enumerate(more_img_urls)
)
Computing cells: 0%| | 0/20 [00:00<?, ? cells/s] Inserting rows into `img_tbl`: 10 rows [00:00, 888.34 rows/s] Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 20/20 [00:00<00:00, 24.93 cells/s] Inserted 10 rows with 0 errors.
UpdateStatus(num_rows=10, num_computed_values=20, num_excs=0, updated_cols=[], cols_with_excs=[])
When we now re-run the initial similarity query, we get a different result:
sim = imgs.img.similarity(sample_img)
res = imgs.order_by(sim, asc=False).limit(2).select(imgs.img, imgs.bucket, imgs.color, sim=sim).collect()
res
img | bucket | color | sim |
---|---|---|---|
0 | red | 1. | |
1 | red | 0.617 |
Creating multiple indexes on a single columnΒΆ
We can create multiple embedding indexes on the same column, utilizing different embedding models. In order to use a specific index in a query, we need to assign it a name and then use that name in the query.
To illustrate this, let's create a table with text (taken from the Wikipedia article on Pablo Picasso):
pxt.drop_table('indices_demo.text_tbl', ignore_errors=True)
txts = pxt.create_table('indices_demo.text_tbl', {'text': pxt.StringType()})
sentences = [
"Pablo Ruiz Picasso (25 October 1881 β 8 April 1973) was a Spanish painter, sculptor, printmaker, ceramicist, and theatre designer who spent most of his adult life in France.",
"One of the most influential artists of the 20th century, he is known for co-founding the Cubist movement, the invention of constructed sculpture,[8][9] the co-invention of collage, and for the wide variety of styles that he helped develop and explore.",
"Among his most famous works are the proto-Cubist Les Demoiselles d'Avignon (1907) and the anti-war painting Guernica (1937), a dramatic portrayal of the bombing of Guernica by German and Italian air forces during the Spanish Civil War.",
"Picasso demonstrated extraordinary artistic talent in his early years, painting in a naturalistic manner through his childhood and adolescence.",
"During the first decade of the 20th century, his style changed as he experimented with different theories, techniques, and ideas.",
"After 1906, the Fauvist work of the older artist Henri Matisse motivated Picasso to explore more radical styles, beginning a fruitful rivalry between the two artists, who subsequently were often paired by critics as the leaders of modern art.",
"Picasso's output, especially in his early career, is often periodized.",
"While the names of many of his later periods are debated, the most commonly accepted periods in his work are the Blue Period (1901β1904), the Rose Period (1904β1906), the African-influenced Period (1907β1909), Analytic Cubism (1909β1912), and Synthetic Cubism (1912β1919), also referred to as the Crystal period.",
"Much of Picasso's work of the late 1910s and early 1920s is in a neoclassical style, and his work in the mid-1920s often has characteristics of Surrealism.",
"His later work often combines elements of his earlier styles.",
]
txts.insert({'text': s} for s in sentences)
Created table `text_tbl`. Computing cells: 0%| | 0/10 [00:00<?, ? cells/s] Inserting rows into `text_tbl`: 10 rows [00:00, 4925.78 rows/s] Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:00<00:00, 2843.79 cells/s] Inserted 10 rows with 0 errors.
UpdateStatus(num_rows=10, num_computed_values=10, num_excs=0, updated_cols=[], cols_with_excs=[])
When calling add_embedding_index()
, we now specify the index name (idx_name
) directly. If it is not specified, Pixeltable will assign a name (such as idx0
).
from pixeltable.functions.huggingface import sentence_transformer
@pxt.expr_udf
def minilm_embed(s: str):
return sentence_transformer(s, model_id='sentence-transformers/all-MiniLM-L12-v2')
@pxt.expr_udf
def e5_embed(s: str):
return sentence_transformer(s, model_id='intfloat/e5-large-v2')
txts.add_embedding_index('text', idx_name='minilm_idx', string_embed=minilm_embed)
txts.add_embedding_index('text', idx_name='e5_idx', string_embed=e5_embed)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:00<00:00, 60.65 cells/s] Computing cells: 100%|βββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:00<00:00, 134.07 cells/s]
To do a similarity query, we now call similarity()
with the idx
parameter:
sim = txts.text.similarity('cubism', idx='minilm_idx')
res = txts.order_by(sim, asc=False).limit(2).select(txts.text, sim).collect()
res
text | col_1 |
---|---|
One of the most influential artists of the 20th century, he is known for co-founding the Cubist movement, the invention of constructed sculpture,[8][9] the co-invention of collage, and for the wide variety of styles that he helped develop and explore. | 0.443 |
While the names of many of his later periods are debated, the most commonly accepted periods in his work are the Blue Period (1901β1904), the Rose Period (1904β1906), the African-influenced Period (1907β1909), Analytic Cubism (1909β1912), and Synthetic Cubism (1912β1919), also referred to as the Crystal period. | 0.426 |
Deleting an indexΒΆ
To delete an index, call Table.drop_embedding_index()
:
- specify the
idx_name
parameter if you have multiple indices - otherwise the
column_name
parameter is sufficient
Given that we have two embedding indices, we'll specify which index to drop:
txts.drop_embedding_index(idx_name='minilm_idx')
Updated 2 months ago