Skip to main content
Pixeltable unifies data and computation into a table interface. In this tutorial, we’ll go into more depth on the Hugging Face integration between datasets and how Hugging Face models can be incorporated into Pixeltable workflows to run models locally.
%pip install -qU pixeltable datasets torch transformers tiktoken spacy
Now let’s load the Hugging Face dataset, as described in the Hugging Face documentation.
import datasets

padoru = (
    datasets.load_dataset("not-lain/padoru", split='train')
    .select_columns(['Image', 'ImageSize', 'Name', 'ImageSource'])
)
It preserves the Hugging Face information about whether the data is part of the test, train or validation split.
padoru
Dataset({
    features: ['Image', 'ImageSize', 'Name', 'ImageSource'],
    num_rows: 382
})

Create a Pixeltable Table from a Hugging Face Dataset

Now we create a table and Pixeltable will map column types as needed. Check out other ways to bring data into Pixeltable with pixeltable.io such as csv, parquet, pandas, json and others.
import pixeltable as pxt

pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')
t = pxt.create_table('hf_demo.padoru', source=padoru)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `hf_demo`.
Created table `padoru_tmp_8951741`.
Computing cells:  13%|█████▏                                   | 64/504 [00:01<00:07, 58.91 cells/s]
Computing cells:  25%|██████████▏                             | 128/504 [00:01<00:05, 73.21 cells/s]
Computing cells:  38%|███████████████▏                        | 192/504 [00:02<00:05, 61.52 cells/s]
Computing cells:  51%|████████████████████▎                   | 256/504 [00:04<00:04, 57.96 cells/s]
Computing cells:  63%|█████████████████████████▍              | 320/504 [00:04<00:02, 64.93 cells/s]
Computing cells:  76%|██████████████████████████████▍         | 384/504 [00:06<00:02, 55.42 cells/s]
Computing cells:  89%|███████████████████████████████████▌    | 448/504 [00:07<00:00, 67.03 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:08<00:00, 51.01 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:07, 17.34 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:08<00:00, 60.68 cells/s]
Inserted 126 rows with 0 errors.
Computing cells:  13%|█████▏                                   | 64/504 [00:00<00:06, 69.02 cells/s]
Computing cells:  25%|██████████▏                             | 128/504 [00:01<00:05, 69.13 cells/s]
Computing cells:  38%|███████████████▏                        | 192/504 [00:02<00:04, 69.73 cells/s]
Computing cells:  51%|████████████████████▎                   | 256/504 [00:03<00:03, 69.73 cells/s]
Computing cells:  63%|█████████████████████████▍              | 320/504 [00:04<00:02, 68.81 cells/s]
Computing cells:  76%|██████████████████████████████▍         | 384/504 [00:05<00:01, 69.80 cells/s]
Computing cells:  89%|███████████████████████████████████▌    | 448/504 [00:06<00:00, 70.41 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 69.62 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:06, 19.92 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 69.45 cells/s]
Inserted 126 rows with 0 errors.
Computing cells:  13%|█████▏                                   | 64/504 [00:00<00:06, 72.73 cells/s]
Computing cells:  25%|██████████▏                             | 128/504 [00:01<00:05, 71.98 cells/s]
Computing cells:  38%|███████████████▏                        | 192/504 [00:02<00:04, 71.21 cells/s]
Computing cells:  51%|████████████████████▎                   | 256/504 [00:03<00:03, 71.61 cells/s]
Computing cells:  63%|█████████████████████████▍              | 320/504 [00:04<00:02, 70.11 cells/s]
Computing cells:  76%|██████████████████████████████▍         | 384/504 [00:05<00:01, 69.79 cells/s]
Computing cells:  89%|███████████████████████████████████▌    | 448/504 [00:06<00:00, 68.98 cells/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 68.68 cells/s]
Inserting rows into `padoru_tmp_8951741`: 126 rows [00:06, 19.96 rows/s]
Computing cells: 100%|████████████████████████████████████████| 504/504 [00:07<00:00, 70.04 cells/s]
Inserted 126 rows with 0 errors.
Computing cells: 100%|██████████████████████████████████████████| 16/16 [00:00<00:00, 67.34 cells/s]
Inserting rows into `padoru_tmp_8951741`: 4 rows [00:00, 3502.55 rows/s]
Computing cells: 100%|██████████████████████████████████████████| 16/16 [00:00<00:00, 66.64 cells/s]
Inserted 4 rows with 0 errors.
t.head(3)
Image ImageSize Name ImageSource
240993 AI-Chan https://knowyourmeme.com/photos/1439336-padoru
993097 Platelet https://knowyourmeme.com/photos/1438687-padoru
255549 Nezuko Kamado https://knowyourmeme.com/photos/1568913-padoru

Leveraging Hugging Face Models with Pixeltable’s Embedding Functionality

Pixeltable contains a built-in adapter for certain model families, so all we have to do is call the Pixeltable function for Hugging Face. A nice thing about the Huggingface models is that they run locally, so you don’t need an account with a service provider in order to use them. Pixeltable can also create and populate an index with table.add_embedding_index() for string and image embeddings. That definition is persisted as part of the table’s metadata, which allows Pixeltable to maintain the index in response to updates to the table. In this example we are using CLIP. You can use any embedding function you like, via Pixeltable’s UDF mechanism (which is described in detail our guide to user-defined functions).
from pixeltable.functions.huggingface import clip
import PIL.Image

# create embedding index on the 'Image' column
t.add_embedding_index(
    'Image',
    embedding=clip.using(model_id='openai/clip-vit-base-patch32')
)
Computing cells: 100%|████████████████████████████████████████| 382/382 [00:16<00:00, 22.63 cells/s]
sample_img = t.select(t.Image).head(1)[0]['Image']

sim = t.Image.similarity(sample_img)

# use 'similarity()' in the order_by() clause and apply a limit in order to utilize the index
t.order_by(sim, asc=False).limit(3).select(t.Image, sim=sim).collect()
Image sim
1.
0.963
0.961
You can learn more about how to leverage indexes in detail with our tutorial: Working with Embedding and Vector Indexes