Hugging Face

Kaggle Colab

Working with Hugging Face

Pixeltable unifies data and computation into a table interface. In this tutorial, we'll go into more depth on the Hugging Face integration between datasets and how Hugging Face models can be incorporated into Pixeltable workflows to run models locally.

%pip install pixeltable datasets -qU
import pixeltable as pxt
import datasets

Now let's load the Hugging Face dataset. You can learn more about the different load methods here

Padoru = datasets.load_dataset("not-lain/padoru", split='train').select_columns(['Image', 'ImageSize', 'Name', 'ImageSource'])
README.md:   0%|          | 0.00/803 [00:00<?, ?B/s]
train-00000-of-00001.parquet:   0%|          | 0.00/152M [00:00<?, ?B/s]
Generating train split:   0%|          | 0/382 [00:00<?, ? examples/s]

It preserves the Hugging Face information about whether the data is part of the test, train or validation split.

Padoru
Dataset({
    features: ['Image', 'ImageSize', 'Name', 'ImageSource'],
    num_rows: 382
})

Create a Pixeltable Table from a Hugging Face Dataset

Now we create a table and Pixeltable will map column types as needed. Check out other ways to bring data into Pixeltable with pixeltable.io such as csv, parquet, pandas, json and others.

t = pxt.io.import_huggingface_dataset('padoru', Padoru)
Creating a Pixeltable instance at: /root/.pixeltable
Connected to Pixeltable database at:
postgresql+psycopg://postgres:@/pixeltable?host=/root/.pixeltable/pgdata
Created table `padoru_tmp_48836957`.
Inserting rows into `padoru_tmp_48836957`: 126 rows [00:00, 4905.16 rows/s]
Inserted 126 rows with 0 errors.
Inserting rows into `padoru_tmp_48836957`: 126 rows [00:00, 5605.04 rows/s]
Inserted 126 rows with 0 errors.
Inserting rows into `padoru_tmp_48836957`: 126 rows [00:00, 5182.47 rows/s]
Inserted 126 rows with 0 errors.
Inserting rows into `padoru_tmp_48836957`: 4 rows [00:00, 735.65 rows/s]
Inserted 4 rows with 0 errors.
t.show(3)

Leveraging Hugging Face Models with Pixeltable's Embedding Functionality

Pixeltable contains a built-in adapter for certain model families, so all we have to do is call the Pixeltable function for Hugging Face. A nice thing about the Huggingface models is that they run locally, so you don't need an account with a service provider in order to use them.

Pixeltable can also create and populate an index with table.add_embedding_index() for string and image embeddings. That definition is persisted as part of the table's metadata, which allows Pixeltable to maintain the index in response to updates to the table.

In this example we are using CLIP. You can use any embedding function you like, via Pixeltable's UDF mechanism (which is described in detail our guide to user-defined functions.

from pixeltable.functions.huggingface import clip_image, clip_text
import PIL.Image

# create a udf that takes a single string, to use as an embedding function
@pxt.expr_udf
def str_embed(s: str):
    return clip_text(s, model_id='openai/clip-vit-base-patch32')

# create a udf that takes a single image, to use as an embedding function
@pxt.expr_udf
def img_embed(Image: PIL.Image.Image):
    return clip_image(Image, model_id='openai/clip-vit-base-patch32')

# create embedding index on the 'Image' column
t.add_embedding_index('Image', string_embed=str_embed, image_embed=img_embed)
Computing cells: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 382/382 [00:36<00:00, 10.40 cells/s]
sample_img = t.select(t.Image).collect()[6]['Image']

sim = t.Image.similarity(sample_img)

# use 'similarity()' in the order_by() clause and apply a limit in order to utilize the index
res = t.order_by(sim, asc=False).limit(2).select(t.Image, sim=sim).collect()
res

You can learn more about how to leverage indexes in detail with our tutorial: Working with Embedding and Vector Indexes