Dataset	Size	Use case
imdb	50K reviews	Sentiment analysis
squad	100K Q&A	RAG evaluation
coco	330K images	Vision model training

text	label
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .	pos
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .	pos
effective but too-tepid biopic	pos
if you sometimes like to go to the movies to have fun , wasabi is a good place to start .	pos
emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .	pos

text	label
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .	pos
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .	pos
effective but too-tepid biopic	pos
if you sometimes like to go to the movies to have fun , wasabi is a good place to start .	pos
emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .	pos

text	label
lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .	pos
consistently clever and suspenseful .	pos
it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .	pos

text	label	text_length
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .	pos	177
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .	pos	226
effective but too-tepid biopic	pos	30
if you sometimes like to go to the movies to have fun , wasabi is a good place to start .	pos	89
emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .	pos	111

Hugging Face Type	Pixeltable Type
`Value('string')`	`pxt.String`
`Value('int64')`	`pxt.Int`
`Value('float32')`	`pxt.Float`
`ClassLabel`	`pxt.String`
`Image`	`pxt.Image`
`Sequence`	`pxt.Array` or `pxt.Json`

## Solution **What’s in this recipe:** * Import Hugging Face datasets directly into tables * Handle datasets with multiple splits (train/test/validation) * Work with image datasets You use `pxt.create_table()` with a Hugging Face dataset as the `source` parameter. Pixeltable automatically maps HF types to Pixeltable column types. ### Setup ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} %pip install -qU pixeltable datasets ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} import pixeltable as pxt from datasets import load_dataset # Create a fresh directory pxt.drop_dir('hf_demo', force=True) pxt.create_dir('hf_demo') ```

  Created directory 'hf\_demo'.
  \

### Import a single split Load a specific split from a dataset: ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Load a small subset for demo (first 100 rows of rotten_tomatoes) hf_dataset = load_dataset( 'cornell-movie-review-data/rotten_tomatoes', split='train[:100]' ) ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Import into Pixeltable reviews = pxt.create_table('hf_demo/reviews', source=hf_dataset) ```

  Created table 'reviews'.
  Inserting rows into \`reviews\`: 100 rows \[00:00, 14781.69 rows/s]
  Inserted 100 rows with 0 errors.

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # View imported data reviews.head(5) ```

### Import multiple splits Load a DatasetDict with multiple splits and track which split each row came from: ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Load dataset with multiple splits (small subset for demo) hf_dataset_dict = load_dataset( 'cornell-movie-review-data/rotten_tomatoes', split={'train': 'train[:50]', 'test': 'test[:50]'}, ) ``` ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Import each split separately for clarity train_data = pxt.create_table( 'hf_demo/reviews_train', source=hf_dataset_dict['train'] ) test_data = pxt.create_table( 'hf_demo/reviews_test', source=hf_dataset_dict['test'] ) ```

  Created table 'reviews\_train'.
  Inserting rows into \`reviews\_train\`: 50 rows \[00:00, 10150.29 rows/s]
  Inserted 50 rows with 0 errors.
  Created table 'reviews\_test'.
  Inserting rows into \`reviews\_test\`: 50 rows \[00:00, 9883.37 rows/s]
  Inserted 50 rows with 0 errors.

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # View training data train_data.head(5) ```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # View test data test_data.head(3) ```

### Add AI-powered computed columns Enrich the dataset with AI models: ```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # Add a computed column for text length reviews.add_computed_column( text_length=reviews.text.apply(len, col_type=pxt.Int) ) ```

  Added 100 column values with 0 errors.
  100 rows updated, 200 values computed.

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}} # View with computed column reviews.select(reviews.text, reviews.label, reviews.text_length).head(5) ```

### Type mapping Pixeltable automatically maps Hugging Face types to Pixeltable types:

Use `schema_overrides` to customize type mapping when needed. ## Explanation **Why import Hugging Face datasets into Pixeltable:** 1. **Add computed columns** - Enrich data with embeddings, AI analysis, or transformations 2. **Incremental processing** - Add new rows without reprocessing existing data 3. **Persistent storage** - Keep processed results across sessions 4. **Query capabilities** - Filter, aggregate, and join with other tables **Working with large datasets:** For very large datasets, consider loading in batches or using streaming mode in the `datasets` library before importing. ## See also * [Import CSV files](/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports * [Semantic text search](/howto/cookbooks/search/search-semantic-text) - Add embeddings to text data * [Hugging Face integration notebook](/howto/providers/working-with-hugging-face) - Full integration guide