Working with Hugging Face

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Pixeltable provides seamless integration with Hugging Face datasets and models. This tutorial covers:

Importing datasets directly into Pixeltable tables
Working with dataset splits (train/test/validation)
Streaming large datasets with IterableDataset
Type mappings from Hugging Face to Pixeltable
Using Hugging Face models for embeddings

Setup

%pip install -qU pixeltable datasets torch transformers sentence-transformers

Import a Hugging Face Dataset

Use pxt.create_table() with the source= parameter to import a Hugging Face dataset directly. Pixeltable automatically maps Hugging Face feature types to Pixeltable column types.

import datasets
import pixeltable as pxt

pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')

# Load a dataset with images
padoru = datasets.load_dataset(
    'not-lain/padoru', split='train'
).select_columns(['Image', 'ImageSize', 'Name', 'ImageSource'])

# Import into Pixeltable
images = pxt.create_table('hf_demo/images', source=padoru)

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘hf_demo’.
Created table ‘images’.
Inserting rows into `images`: 100 rows [00:00, 310.24 rows/s]
Inserting rows into `images`: 100 rows [00:00, 353.22 rows/s]
Inserting rows into `images`: 100 rows [00:00, 368.40 rows/s]
Inserting rows into `images`: 82 rows [00:00, 567.89 rows/s]
Inserted 382 rows with 0 errors.

images.head(3)

Working with Dataset Splits

When importing a DatasetDict (which contains multiple splits like train/test), use extra_args={'column_name_for_split': 'split'} to preserve split information in a column.

# Load a dataset with multiple splits
imdb = datasets.load_dataset('stanfordnlp/imdb')

# Import all splits, storing split info in 'split' column
reviews = pxt.create_table(
    'hf_demo/reviews',
    source=imdb,
    extra_args={'column_name_for_split': 'split'},
)

# Query by split
reviews.where(reviews.split == 'train').limit(3).select(
    reviews.text, reviews.label, reviews.split
).collect()

# Count rows per split
reviews.group_by(reviews.split).select(
    reviews.split, count=pxt.functions.count(reviews.text)
).collect()

Using schema_overrides for Embeddings When importing datasets with pre-computed embeddings (common in RAG), use schema_overrides to specify the exact array shape:

# Wikipedia with pre-computed embeddings - specify array shape
wiki_ds = (
    datasets.load_dataset(
        'Cohere/wikipedia-2023-11-embed-multilingual-v3',
        'simple',
        split='train',
        streaming=True,
    )
    .select_columns(['url', 'title', 'text', 'emb'])
    .take(50)
)

wiki = pxt.create_table(
    'hf_demo/wiki_embeddings',
    source=wiki_ds,
    schema_overrides={'emb': pxt.Array[(1024,), pxt.Float]},
)

wiki.select(wiki.title, wiki.emb).limit(2).collect()

Streaming Large Datasets

For very large datasets, use streaming=True to filter and sample before importing:

# Stream, filter, and sample before importing
streaming_ds = datasets.load_dataset(
    'stanfordnlp/imdb', split='train', streaming=True
)
positive_stream = streaming_ds.filter(lambda x: x['label'] == 1).take(50)

positive_samples = pxt.create_table(
    'hf_demo/positive_samples', source=positive_stream
)

positive_samples.select(
    positive_samples.text, positive_samples.label
).limit(2).collect()

Importing Audio Datasets

Audio datasets work seamlessly - Pixeltable stores audio files locally:

# Import a small audio dataset
audio_ds = datasets.load_dataset(
    'hf-internal-testing/librispeech_asr_dummy',
    'clean',
    split='validation',
)

audio_table = pxt.create_table('hf_demo/audio_samples', source=audio_ds)
audio_table.select(audio_table.audio, audio_table.text).limit(2).collect()

Created table ‘audio_samples’.
Inserting rows into `audio_samples`: 73 rows [00:00, 3960.27 rows/s]
Inserted 73 rows with 0 errors.

Inserting More Data

Use table.insert() to add more data from a HuggingFace dataset to an existing table:

# Insert more data from the same or similar dataset
more_audio = datasets.load_dataset(
    'hf-internal-testing/librispeech_asr_dummy',
    'clean',
    split='validation',
).select(range(5))

audio_table.insert(more_audio)
audio_table.count()

Inserting rows into `audio_samples`: 5 rows [00:00, 3186.68 rows/s]
Inserted 5 rows with 0 errors.
78

Type Mappings Reference

Using Hugging Face Models

Pixeltable integrates with Hugging Face models for embeddings and inference, running locally without API keys.

Image Embeddings with CLIP

from pixeltable.functions.huggingface import clip

# Add CLIP embedding index for cross-modal image search
images.add_embedding_index(
    'Image', embedding=clip.using(model_id='openai/clip-vit-base-patch32')
)

# Search images using text
sim = images.Image.similarity(string='anime character with red clothes')
images.order_by(sim, asc=False).limit(3).select(
    images.Image, images.Name, sim=sim
).collect()

Text Embeddings with Sentence Transformers

from pixeltable.functions.huggingface import sentence_transformer

# Create table with text embedding index
sample_reviews = pxt.create_table(
    'hf_demo/sample_reviews',
    source=datasets.load_dataset('stanfordnlp/imdb', split='test').select(
        range(100)
    ),
)
sample_reviews.add_embedding_index(
    'text',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'),
)

# Semantic search
query = 'great acting and cinematography'
sim = sample_reviews.text.similarity(string=query)
sample_reviews.order_by(sim, asc=False).limit(3).select(
    sample_reviews.text, sim=sim
).collect()

Created table ‘sample_reviews’.
Inserting rows into `sample_reviews`: 100 rows [00:00, 21625.70 rows/s]
Inserted 100 rows with 0 errors.

More Hugging Face Models

Pixeltable supports many more HuggingFace models including:

ASR: automatic_speech_recognition() - transcribe audio
Translation: translation() - translate between languages
Text Generation: text_generation() - generate text completions
Image Classification: vit_for_image_classification() - classify images
Object Detection: detr_for_object_detection() - detect objects in images

See the SDK reference below for the complete list.

Welcome to Pixeltable

Core Concepts

How-To

Setup

Import a Hugging Face Dataset

Working with Dataset Splits

Streaming Large Datasets

Importing Audio Datasets

Inserting More Data

Type Mappings Reference

Using Hugging Face Models

Image Embeddings with CLIP

Text Embeddings with Sentence Transformers

More Hugging Face Models

See Also

Welcome to Pixeltable

Core Concepts

How-To

​Setup

​Import a Hugging Face Dataset

​Working with Dataset Splits

​Streaming Large Datasets

​Importing Audio Datasets

​Inserting More Data

​Type Mappings Reference

​Using Hugging Face Models

​Image Embeddings with CLIP

​Text Embeddings with Sentence Transformers

​More Hugging Face Models

​See Also

Setup

Import a Hugging Face Dataset

Working with Dataset Splits

Streaming Large Datasets

Importing Audio Datasets

Inserting More Data

Type Mappings Reference

Using Hugging Face Models

Image Embeddings with CLIP

Text Embeddings with Sentence Transformers

More Hugging Face Models

See Also