This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Pixeltable provides seamless integration with Hugging Face datasets and
models. This tutorial covers:
- Importing datasets directly into Pixeltable tables
- Working with dataset splits (train/test/validation)
- Streaming large datasets with
IterableDataset
- Type mappings from Hugging Face to Pixeltable
- Using Hugging Face models for embeddings
Setup
%pip install -qU pixeltable datasets torch transformers sentence-transformers
Import a Hugging Face Dataset
Use pxt.create_table() with the source= parameter to import a
Hugging Face dataset directly. Pixeltable automatically maps Hugging
Face feature types to Pixeltable column types.
import pixeltable as pxt
import datasets
pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')
# Load a dataset with images
padoru = (
datasets.load_dataset("not-lain/padoru", split='train')
.select_columns(['Image', 'ImageSize', 'Name', 'ImageSource'])
)
# Import into Pixeltable
images = pxt.create_table('hf_demo.images', source=padoru)
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘hf_demo’.
Created table ‘images’.
Inserting rows into `images`: 100 rows [00:00, 310.24 rows/s]
Inserting rows into `images`: 100 rows [00:00, 353.22 rows/s]
Inserting rows into `images`: 100 rows [00:00, 368.40 rows/s]
Inserting rows into `images`: 82 rows [00:00, 567.89 rows/s]
Inserted 382 rows with 0 errors.
Working with Dataset Splits
When importing a DatasetDict (which contains multiple splits like
train/test), use extra_args={'column_name_for_split': 'split'} to
preserve split information in a column.
# Load a dataset with multiple splits
imdb = datasets.load_dataset('stanfordnlp/imdb')
# Import all splits, storing split info in 'split' column
reviews = pxt.create_table(
'hf_demo.reviews',
source=imdb,
extra_args={'column_name_for_split': 'split'}
)
# Query by split
reviews.where(reviews.split == 'train').limit(3).select(reviews.text, reviews.label, reviews.split).collect()
# Count rows per split
reviews.group_by(reviews.split).select(reviews.split, count=pxt.functions.count(reviews.text)).collect()
Using schema_overrides for Embeddings
When importing datasets with pre-computed embeddings (common in RAG),
use schema_overrides to specify the exact array shape:
# Wikipedia with pre-computed embeddings - specify array shape
wiki_ds = (
datasets.load_dataset('Cohere/wikipedia-2023-11-embed-multilingual-v3', 'simple', split='train', streaming=True)
.select_columns(['url', 'title', 'text', 'emb'])
.take(50)
)
wiki = pxt.create_table(
'hf_demo.wiki_embeddings',
source=wiki_ds,
schema_overrides={'emb': pxt.Array[(1024,), pxt.Float]}
)
wiki.select(wiki.title, wiki.emb).limit(2).collect()
Streaming Large Datasets
For very large datasets, use streaming=True to filter and sample
before importing:
# Stream, filter, and sample before importing
streaming_ds = datasets.load_dataset('stanfordnlp/imdb', split='train', streaming=True)
positive_stream = streaming_ds.filter(lambda x: x['label'] == 1).take(50)
positive_samples = pxt.create_table('hf_demo.positive_samples', source=positive_stream)
positive_samples.select(positive_samples.text, positive_samples.label).limit(2).collect()
Importing Audio Datasets
Audio datasets work seamlessly - Pixeltable stores audio files locally:
# Import a small audio dataset
audio_ds = datasets.load_dataset(
'hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation'
)
audio_table = pxt.create_table('hf_demo.audio_samples', source=audio_ds)
audio_table.select(audio_table.audio, audio_table.text).limit(2).collect()
Created table ‘audio_samples’.
Inserting rows into `audio_samples`: 73 rows [00:00, 3960.27 rows/s]
Inserted 73 rows with 0 errors.
Inserting More Data
Use table.insert() to add more data from a HuggingFace dataset to an
existing table:
# Insert more data from the same or similar dataset
more_audio = datasets.load_dataset(
'hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation'
).select(range(5))
audio_table.insert(more_audio)
audio_table.count()
Inserting rows into `audio_samples`: 5 rows [00:00, 3186.68 rows/s]
Inserted 5 rows with 0 errors.
78
Type Mappings Reference
Using Hugging Face Models
Pixeltable integrates with Hugging Face models for embeddings and
inference, running locally without API keys.
Image Embeddings with CLIP
from pixeltable.functions.huggingface import clip
# Add CLIP embedding index for cross-modal image search
images.add_embedding_index(
'Image',
embedding=clip.using(model_id='openai/clip-vit-base-patch32')
)
# Search images using text
sim = images.Image.similarity(string='anime character with red clothes')
images.order_by(sim, asc=False).limit(3).select(images.Image, images.Name, sim=sim).collect()
Text Embeddings with Sentence Transformers
from pixeltable.functions.huggingface import sentence_transformer
# Create table with text embedding index
sample_reviews = pxt.create_table(
'hf_demo.sample_reviews',
source=datasets.load_dataset('stanfordnlp/imdb', split='test').select(range(100))
)
sample_reviews.add_embedding_index('text', string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2'))
# Semantic search
query = "great acting and cinematography"
sim = sample_reviews.text.similarity(string=query)
sample_reviews.order_by(sim, asc=False).limit(3).select(sample_reviews.text, sim=sim).collect()
Created table ‘sample_reviews’.
Inserting rows into `sample_reviews`: 100 rows [00:00, 21625.70 rows/s]
Inserted 100 rows with 0 errors.
More Hugging Face Models
Pixeltable supports many more HuggingFace models including: - ASR:
automatic_speech_recognition() - transcribe audio - Translation:
translation() - translate between languages - Text Generation:
text_generation() - generate text completions - Image
Classification: vit_for_image_classification() - classify images -
Object Detection: detr_for_object_detection() - detect objects in
images
See the SDK reference below for the complete list.
See Also