Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Pixeltable provides seamless integration with Hugging Face datasets and models. This tutorial covers:
  • Importing datasets directly into Pixeltable tables
  • Working with dataset splits (train/test/validation)
  • Streaming large datasets with IterableDataset
  • Type mappings from Hugging Face to Pixeltable
  • Using Hugging Face models for embeddings

Setup

%pip install -qU pixeltable datasets torch transformers sentence-transformers

Import a Hugging Face Dataset

Use pxt.create_table() with the source= parameter to import a Hugging Face dataset directly. Pixeltable automatically maps Hugging Face feature types to Pixeltable column types.
import pixeltable as pxt
import datasets

pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')

# Load a dataset with images
padoru = (
    datasets.load_dataset("not-lain/padoru", split='train')
    .select_columns(['Image', 'ImageSize', 'Name', 'ImageSource'])
)

# Import into Pixeltable
images = pxt.create_table('hf_demo.images', source=padoru)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘hf_demo’.
Created table ‘images’.
Inserting rows into `images`: 100 rows [00:00, 310.24 rows/s]
Inserting rows into `images`: 100 rows [00:00, 353.22 rows/s]
Inserting rows into `images`: 100 rows [00:00, 368.40 rows/s]
Inserting rows into `images`: 82 rows [00:00, 567.89 rows/s]
Inserted 382 rows with 0 errors.
images.head(3)

Working with Dataset Splits

When importing a DatasetDict (which contains multiple splits like train/test), use extra_args={'column_name_for_split': 'split'} to preserve split information in a column.
# Load a dataset with multiple splits
imdb = datasets.load_dataset('stanfordnlp/imdb')

# Import all splits, storing split info in 'split' column
reviews = pxt.create_table(
    'hf_demo.reviews',
    source=imdb,
    extra_args={'column_name_for_split': 'split'}
)
# Query by split
reviews.where(reviews.split == 'train').limit(3).select(reviews.text, reviews.label, reviews.split).collect()

# Count rows per split
reviews.group_by(reviews.split).select(reviews.split, count=pxt.functions.count(reviews.text)).collect()
Using schema_overrides for Embeddings When importing datasets with pre-computed embeddings (common in RAG), use schema_overrides to specify the exact array shape:
# Wikipedia with pre-computed embeddings - specify array shape
wiki_ds = (
    datasets.load_dataset('Cohere/wikipedia-2023-11-embed-multilingual-v3', 'simple', split='train', streaming=True)
    .select_columns(['url', 'title', 'text', 'emb'])
    .take(50)
)

wiki = pxt.create_table(
    'hf_demo.wiki_embeddings',
    source=wiki_ds,
    schema_overrides={'emb': pxt.Array[(1024,), pxt.Float]}
)
wiki.select(wiki.title, wiki.emb).limit(2).collect()

Streaming Large Datasets

For very large datasets, use streaming=True to filter and sample before importing:
# Stream, filter, and sample before importing
streaming_ds = datasets.load_dataset('stanfordnlp/imdb', split='train', streaming=True)
positive_stream = streaming_ds.filter(lambda x: x['label'] == 1).take(50)
positive_samples = pxt.create_table('hf_demo.positive_samples', source=positive_stream)
positive_samples.select(positive_samples.text, positive_samples.label).limit(2).collect()

Importing Audio Datasets

Audio datasets work seamlessly - Pixeltable stores audio files locally:
# Import a small audio dataset
audio_ds = datasets.load_dataset(
    'hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation'
)

audio_table = pxt.create_table('hf_demo.audio_samples', source=audio_ds)
audio_table.select(audio_table.audio, audio_table.text).limit(2).collect()
Created table ‘audio_samples’.
Inserting rows into `audio_samples`: 73 rows [00:00, 3960.27 rows/s]
Inserted 73 rows with 0 errors.

Inserting More Data

Use table.insert() to add more data from a HuggingFace dataset to an existing table:
# Insert more data from the same or similar dataset
more_audio = datasets.load_dataset(
    'hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation'
).select(range(5))

audio_table.insert(more_audio)
audio_table.count()
Inserting rows into `audio_samples`: 5 rows [00:00, 3186.68 rows/s]
Inserted 5 rows with 0 errors.
78

Type Mappings Reference

Using Hugging Face Models

Pixeltable integrates with Hugging Face models for embeddings and inference, running locally without API keys.

Image Embeddings with CLIP

from pixeltable.functions.huggingface import clip

# Add CLIP embedding index for cross-modal image search
images.add_embedding_index(
    'Image',
    embedding=clip.using(model_id='openai/clip-vit-base-patch32')
)

# Search images using text
sim = images.Image.similarity(string='anime character with red clothes')
images.order_by(sim, asc=False).limit(3).select(images.Image, images.Name, sim=sim).collect()

Text Embeddings with Sentence Transformers

from pixeltable.functions.huggingface import sentence_transformer

# Create table with text embedding index
sample_reviews = pxt.create_table(
    'hf_demo.sample_reviews',
    source=datasets.load_dataset('stanfordnlp/imdb', split='test').select(range(100))
)
sample_reviews.add_embedding_index('text', string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'))

# Semantic search
query = "great acting and cinematography"
sim = sample_reviews.text.similarity(string=query)
sample_reviews.order_by(sim, asc=False).limit(3).select(sample_reviews.text, sim=sim).collect()
Created table ‘sample_reviews’.
Inserting rows into `sample_reviews`: 100 rows [00:00, 21625.70 rows/s]
Inserted 100 rows with 0 errors.

More Hugging Face Models

Pixeltable supports many more HuggingFace models including: - ASR: automatic_speech_recognition() - transcribe audio - Translation: translation() - translate between languages - Text Generation: text_generation() - generate text completions - Image Classification: vit_for_image_classification() - classify images - Object Detection: detr_for_object_detection() - detect objects in images See the SDK reference below for the complete list.

See Also

Last modified on December 22, 2025