This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Load datasets from Hugging Face Hub into Pixeltable tables for
processing with AI models.
Problem
You want to use a dataset from Hugging Face Hub—for fine-tuning,
evaluation, or analysis. You need to load it into a format where you can
add computed columns, embeddings, or AI transformations.
Solution
What’s in this recipe:
- Import Hugging Face datasets directly into tables
- Handle datasets with multiple splits (train/test/validation)
- Work with image datasets
You use pxt.create_table() with a Hugging Face dataset as the source
parameter. Pixeltable automatically maps HF types to Pixeltable column
types.
Setup
%pip install -qU pixeltable datasets
import pixeltable as pxt
from datasets import load_dataset
# Create a fresh directory
pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')
Created directory ‘hf_demo’.
<pixeltable.catalog.dir.Dir at 0x31e39d8d0>
Import a single split
Load a specific split from a dataset:
# Load a small subset for demo (first 100 rows of rotten_tomatoes)
hf_dataset = load_dataset('rotten_tomatoes', split='train[:100]')
# Import into Pixeltable
reviews = pxt.create_table(
'hf_demo.reviews',
source=hf_dataset
)
Created table ‘reviews’.
Inserting rows into `reviews`: 100 rows [00:00, 14781.69 rows/s]
Inserted 100 rows with 0 errors.
# View imported data
reviews.head(5)
Import multiple splits
Load a DatasetDict with multiple splits and track which split each row
came from:
# Load dataset with multiple splits (small subset for demo)
hf_dataset_dict = load_dataset(
'rotten_tomatoes',
split={'train': 'train[:50]', 'test': 'test[:50]'}
)
# Import each split separately for clarity
train_data = pxt.create_table(
'hf_demo.reviews_train',
source=hf_dataset_dict['train']
)
test_data = pxt.create_table(
'hf_demo.reviews_test',
source=hf_dataset_dict['test']
)
Created table ‘reviews_train’.
Inserting rows into `reviews_train`: 50 rows [00:00, 10150.29 rows/s]
Inserted 50 rows with 0 errors.
Created table ‘reviews_test’.
Inserting rows into `reviews_test`: 50 rows [00:00, 9883.37 rows/s]
Inserted 50 rows with 0 errors.
# View training data
train_data.head(5)
# View test data
test_data.head(3)
Add AI-powered computed columns
Enrich the dataset with AI models:
# Add a computed column for text length
reviews.add_computed_column(text_length=reviews.text.apply(len, col_type=pxt.Int))
Added 100 column values with 0 errors.
100 rows updated, 200 values computed.
# View with computed column
reviews.select(reviews.text, reviews.label, reviews.text_length).head(5)
Type mapping
Pixeltable automatically maps Hugging Face types to Pixeltable types:
Use schema_overrides to customize type mapping when needed.
Explanation
Why import Hugging Face datasets into Pixeltable:
- Add computed columns - Enrich data with embeddings, AI analysis,
or transformations
- Incremental processing - Add new rows without reprocessing
existing data
- Persistent storage - Keep processed results across sessions
- Query capabilities - Filter, aggregate, and join with other
tables
Working with large datasets:
For very large datasets, consider loading in batches or using streaming
mode in the datasets library before importing.
See also