> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Import data from Hugging Face datasets

> Import Hugging Face datasets directly into Pixeltable tables for vision, text, and multimodal ML training and evaluation workflows.

<a href="https://kaggle.com/kernels/welcome?src=https://github.com/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/data/data-import-huggingface.ipynb" id="openKaggle" target="_blank" rel="noopener noreferrer"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open in Kaggle" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://colab.research.google.com/github/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/data/data-import-huggingface.ipynb" id="openColab" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://raw.githubusercontent.com/pixeltable/pixeltable/refs/tags/release/docs/release/howto/cookbooks/data/data-import-huggingface.ipynb" id="downloadNotebook" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/%E2%AC%87-Download%20Notebook-blue" alt="Download Notebook" style={{ display: 'inline', margin: '0px' }} noZoom /></a>

<Tip>This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.</Tip>

export const quartoRawHtml = [`
<table>
<thead>
<tr>
<th>Dataset</th>
<th>Size</th>
<th>Use case</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">imdb</td>
<td style="vertical-align: middle;">50K reviews</td>
<td style="vertical-align: middle;">Sentiment analysis</td>
</tr>
<tr>
<td style="vertical-align: middle;">squad</td>
<td style="vertical-align: middle;">100K Q&amp;A</td>
<td style="vertical-align: middle;">RAG evaluation</td>
</tr>
<tr>
<td style="vertical-align: middle;">coco</td>
<td style="vertical-align: middle;">330K images</td>
<td style="vertical-align: middle;">Vision model training</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">the rock is destined to be the 21st century's new " conan " and that
he's going to make a splash even greater than arnold schwarzenegger ,
jean-claud van damme or steven segal .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">the gorgeously elaborate continuation of " the lord of the rings "
trilogy is so huge that a column of words cannot adequately describe
co-writer/director peter jackson's expanded vision of j . r . r .
tolkien's middle-earth .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">effective but too-tepid biopic</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">if you sometimes like to go to the movies to have fun , wasabi is a
good place to start .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">emerges as something rare , an issue movie that's so honest and
keenly observed that it doesn't feel like one .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">the rock is destined to be the 21st century's new " conan " and that
he's going to make a splash even greater than arnold schwarzenegger ,
jean-claud van damme or steven segal .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">the gorgeously elaborate continuation of " the lord of the rings "
trilogy is so huge that a column of words cannot adequately describe
co-writer/director peter jackson's expanded vision of j . r . r .
tolkien's middle-earth .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">effective but too-tepid biopic</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">if you sometimes like to go to the movies to have fun , wasabi is a
good place to start .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">emerges as something rare , an issue movie that's so honest and
keenly observed that it doesn't feel like one .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">lovingly photographed in the manner of a golden book sprung to life
, stuart little 2 manages sweetness largely without stickiness .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">consistently clever and suspenseful .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
<tr>
<td style="vertical-align: middle;">it's like a " big chill " reunion of the baader-meinhof gang , only
these guys are more harmless pranksters than political activists .</td>
<td style="vertical-align: middle;">pos</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">text</th>
<th data-quarto-table-cell-role="th">label</th>
<th data-quarto-table-cell-role="th">text_length</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">the rock is destined to be the 21st century's new " conan " and that
he's going to make a splash even greater than arnold schwarzenegger ,
jean-claud van damme or steven segal .</td>
<td style="vertical-align: middle;">pos</td>
<td style="vertical-align: middle;">177</td>
</tr>
<tr>
<td style="vertical-align: middle;">the gorgeously elaborate continuation of " the lord of the rings "
trilogy is so huge that a column of words cannot adequately describe
co-writer/director peter jackson's expanded vision of j . r . r .
tolkien's middle-earth .</td>
<td style="vertical-align: middle;">pos</td>
<td style="vertical-align: middle;">226</td>
</tr>
<tr>
<td style="vertical-align: middle;">effective but too-tepid biopic</td>
<td style="vertical-align: middle;">pos</td>
<td style="vertical-align: middle;">30</td>
</tr>
<tr>
<td style="vertical-align: middle;">if you sometimes like to go to the movies to have fun , wasabi is a
good place to start .</td>
<td style="vertical-align: middle;">pos</td>
<td style="vertical-align: middle;">89</td>
</tr>
<tr>
<td style="vertical-align: middle;">emerges as something rare , an issue movie that's so honest and
keenly observed that it doesn't feel like one .</td>
<td style="vertical-align: middle;">pos</td>
<td style="vertical-align: middle;">111</td>
</tr>
</tbody>
</table>
`, `
<table>
<thead>
<tr>
<th>Hugging Face Type</th>
<th>Pixeltable Type</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;"><code>Value('string')</code></td>
<td style="vertical-align: middle;"><code>pxt.String</code></td>
</tr>
<tr>
<td style="vertical-align: middle;"><code>Value('int64')</code></td>
<td style="vertical-align: middle;"><code>pxt.Int</code></td>
</tr>
<tr>
<td style="vertical-align: middle;"><code>Value('float32')</code></td>
<td style="vertical-align: middle;"><code>pxt.Float</code></td>
</tr>
<tr>
<td style="vertical-align: middle;"><code>ClassLabel</code></td>
<td style="vertical-align: middle;"><code>pxt.String</code></td>
</tr>
<tr>
<td style="vertical-align: middle;"><code>Image</code></td>
<td style="vertical-align: middle;"><code>pxt.Image</code></td>
</tr>
<tr>
<td style="vertical-align: middle;"><code>Sequence</code></td>
<td style="vertical-align: middle;"><code>pxt.Array</code> or <code>pxt.Json</code></td>
</tr>
</tbody>
</table>
`];

Load datasets from Hugging Face Hub into Pixeltable tables for
processing with AI models.

## Problem

You want to use a dataset from Hugging Face Hub—for fine-tuning,
evaluation, or analysis. You need to load it into a format where you can
add computed columns, embeddings, or AI transformations.

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[0] }} />

## Solution

**What’s in this recipe:**

* Import Hugging Face datasets directly into tables
* Handle datasets with multiple splits (train/test/validation)
* Work with image datasets

You use `pxt.create_table()` with a Hugging Face dataset as the `source`
parameter. Pixeltable automatically maps HF types to Pixeltable column
types.

### Setup

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
%pip install -qU pixeltable datasets
```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
import pixeltable as pxt
from datasets import load_dataset

# Create a fresh directory
pxt.drop_dir('hf_demo', force=True)
pxt.create_dir('hf_demo')
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created directory 'hf\_demo'.
  \<pixeltable.catalog.dir.Dir at 0x31e39d8d0>
</pre>

### Import a single split

Load a specific split from a dataset:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Load a small subset for demo (first 100 rows of rotten_tomatoes)
hf_dataset = load_dataset(
    'cornell-movie-review-data/rotten_tomatoes', split='train[:100]'
)
```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Import into Pixeltable
reviews = pxt.create_table('hf_demo/reviews', source=hf_dataset)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created table 'reviews'.
  Inserting rows into \`reviews\`: 100 rows \[00:00, 14781.69 rows/s]
  Inserted 100 rows with 0 errors.
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# View imported data
reviews.head(5)
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[1] }} />

### Import multiple splits

Load a DatasetDict with multiple splits and track which split each row
came from:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Load dataset with multiple splits (small subset for demo)
hf_dataset_dict = load_dataset(
    'cornell-movie-review-data/rotten_tomatoes',
    split={'train': 'train[:50]', 'test': 'test[:50]'},
)
```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Import each split separately for clarity
train_data = pxt.create_table(
    'hf_demo/reviews_train', source=hf_dataset_dict['train']
)
test_data = pxt.create_table(
    'hf_demo/reviews_test', source=hf_dataset_dict['test']
)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created table 'reviews\_train'.
  Inserting rows into \`reviews\_train\`: 50 rows \[00:00, 10150.29 rows/s]
  Inserted 50 rows with 0 errors.
  Created table 'reviews\_test'.
  Inserting rows into \`reviews\_test\`: 50 rows \[00:00, 9883.37 rows/s]
  Inserted 50 rows with 0 errors.
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# View training data
train_data.head(5)
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[2] }} />

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# View test data
test_data.head(3)
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[3] }} />

### Add AI-powered computed columns

Enrich the dataset with AI models:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Add a computed column for text length
reviews.add_computed_column(
    text_length=reviews.text.apply(len, col_type=pxt.Int)
)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Added 100 column values with 0 errors.
  100 rows updated, 200 values computed.
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# View with computed column
reviews.select(reviews.text, reviews.label, reviews.text_length).head(5)
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[4] }} />

### Type mapping

Pixeltable automatically maps Hugging Face types to Pixeltable types:

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[5] }} />

Use `schema_overrides` to customize type mapping when needed.

## Explanation

**Why import Hugging Face datasets into Pixeltable:**

1. **Add computed columns** - Enrich data with embeddings, AI analysis,
   or transformations
2. **Incremental processing** - Add new rows without reprocessing
   existing data
3. **Persistent storage** - Keep processed results across sessions
4. **Query capabilities** - Filter, aggregate, and join with other
   tables

**Working with large datasets:**

For very large datasets, consider loading in batches or using streaming
mode in the `datasets` library before importing.

## See also

* [Import CSV
  files](/howto/cookbooks/data/data-import-csv) -
  For CSV and Excel imports
* [Semantic text
  search](/howto/cookbooks/search/search-semantic-text) -
  Add embeddings to text data
* [Hugging Face integration
  notebook](/howto/providers/working-with-hugging-face) -
  Full integration guide
