Import data from Hugging Face datasets

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Load datasets from Hugging Face Hub into Pixeltable tables for processing with AI models.

Problem

You want to use a dataset from Hugging Face Hub—for fine-tuning, evaluation, or analysis. You need to load it into a format where you can add computed columns, embeddings, or AI transformations.

Solution

What’s in this recipe:

Import Hugging Face datasets directly into tables
Handle datasets with multiple splits (train/test/validation)
Work with image datasets

You use pxt.create_table() with a Hugging Face dataset as the source parameter. Pixeltable automatically maps HF types to Pixeltable column types.

Setup

%pip install -qU pixeltable datasets

import pixeltable as pxt
from datasets import load_dataset

# Create a fresh directory
pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')

Created directory ‘hf_demo’.
<pixeltable.catalog.dir.Dir at 0x31e39d8d0>

Import a single split

Load a specific split from a dataset:

# Load a small subset for demo (first 100 rows of rotten_tomatoes)
hf_dataset = load_dataset(
    'cornell-movie-review-data/rotten_tomatoes', split='train[:100]'
)

# Import into Pixeltable
reviews = pxt.create_table('hf_demo/reviews', source=hf_dataset)

Created table ‘reviews’.
Inserting rows into `reviews`: 100 rows [00:00, 14781.69 rows/s]
Inserted 100 rows with 0 errors.

# View imported data
reviews.head(5)

Import multiple splits

Load a DatasetDict with multiple splits and track which split each row came from:

# Load dataset with multiple splits (small subset for demo)
hf_dataset_dict = load_dataset(
    'cornell-movie-review-data/rotten_tomatoes',
    split={'train': 'train[:50]', 'test': 'test[:50]'},
)

# Import each split separately for clarity
train_data = pxt.create_table(
    'hf_demo/reviews_train', source=hf_dataset_dict['train']
)
test_data = pxt.create_table(
    'hf_demo/reviews_test', source=hf_dataset_dict['test']
)

Created table ‘reviews_train’.
Inserting rows into `reviews_train`: 50 rows [00:00, 10150.29 rows/s]
Inserted 50 rows with 0 errors.
Created table ‘reviews_test’.
Inserting rows into `reviews_test`: 50 rows [00:00, 9883.37 rows/s]
Inserted 50 rows with 0 errors.

# View training data
train_data.head(5)

# View test data
test_data.head(3)

Add AI-powered computed columns

Enrich the dataset with AI models:

# Add a computed column for text length
reviews.add_computed_column(
    text_length=reviews.text.apply(len, col_type=pxt.Int)
)

Added 100 column values with 0 errors.
100 rows updated, 200 values computed.

# View with computed column
reviews.select(reviews.text, reviews.label, reviews.text_length).head(5)

Type mapping

Pixeltable automatically maps Hugging Face types to Pixeltable types:

Use schema_overrides to customize type mapping when needed.

Explanation

Why import Hugging Face datasets into Pixeltable:

Add computed columns - Enrich data with embeddings, AI analysis, or transformations
Incremental processing - Add new rows without reprocessing existing data
Persistent storage - Keep processed results across sessions
Query capabilities - Filter, aggregate, and join with other tables

Working with large datasets: For very large datasets, consider loading in batches or using streaming mode in the datasets library before importing.

Welcome to Pixeltable

Core Concepts

How-To

Problem

Solution

Setup

Import a single split

Import multiple splits

Add AI-powered computed columns

Type mapping

Explanation

See also

Welcome to Pixeltable

Core Concepts

How-To

​Problem

​Solution

​Setup

​Import a single split

​Import multiple splits

​Add AI-powered computed columns

​Type mapping

​Explanation

​See also

Problem

Solution

Setup

Import a single split

Import multiple splits

Add AI-powered computed columns

Type mapping

Explanation

See also