> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Import data from Parquet files

> Ingest Apache Parquet files into Pixeltable tables for fast columnar loading of large analytics, ML training, and feature-store datasets.

<a href="https://kaggle.com/kernels/welcome?src=https://github.com/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/data/data-import-parquet.ipynb" id="openKaggle" target="_blank" rel="noopener noreferrer"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open in Kaggle" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://colab.research.google.com/github/pixeltable/pixeltable/blob/release/docs/release/howto/cookbooks/data/data-import-parquet.ipynb" id="openColab" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" style={{ display: 'inline', margin: '0px' }} noZoom /></a>  <a href="https://raw.githubusercontent.com/pixeltable/pixeltable/refs/tags/release/docs/release/howto/cookbooks/data/data-import-parquet.ipynb" id="downloadNotebook" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/%E2%AC%87-Download%20Notebook-blue" alt="Download Notebook" style={{ display: 'inline', margin: '0px' }} noZoom /></a>

<Tip>This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.</Tip>

export const quartoRawHtml = [`
<table>
<thead>
<tr>
<th>Source</th>
<th>Size</th>
<th>Use case</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">embeddings.parquet</td>
<td style="vertical-align: middle;">1M vectors</td>
<td style="vertical-align: middle;">Add to similarity search</td>
</tr>
<tr>
<td style="vertical-align: middle;">transactions.parquet</td>
<td style="vertical-align: middle;">10M rows</td>
<td style="vertical-align: middle;">Analyze with computed columns</td>
</tr>
<tr>
<td style="vertical-align: middle;">features.parquet</td>
<td style="vertical-align: middle;">500K rows</td>
<td style="vertical-align: middle;">Combine with media data</td>
</tr>
</tbody>
</table>
`, `<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
    .dataframe tbody tr th {
        vertical-align: top;
    }
    .dataframe thead th {
        text-align: right;
    }
</style>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">product_id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">price</th>
<th data-quarto-table-cell-role="th">category</th>
<th data-quarto-table-cell-role="th">in_stock</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">0</td>
<td style="vertical-align: middle;">1</td>
<td style="vertical-align: middle;">Widget A</td>
<td style="vertical-align: middle;">29.99</td>
<td style="vertical-align: middle;">widgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">1</td>
<td style="vertical-align: middle;">2</td>
<td style="vertical-align: middle;">Widget B</td>
<td style="vertical-align: middle;">39.99</td>
<td style="vertical-align: middle;">widgets</td>
<td style="vertical-align: middle;">False</td>
</tr>
<tr>
<td style="vertical-align: middle;">2</td>
<td style="vertical-align: middle;">3</td>
<td style="vertical-align: middle;">Gadget X</td>
<td style="vertical-align: middle;">149.99</td>
<td style="vertical-align: middle;">gadgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">3</td>
<td style="vertical-align: middle;">4</td>
<td style="vertical-align: middle;">Gadget Y</td>
<td style="vertical-align: middle;">199.99</td>
<td style="vertical-align: middle;">gadgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">4</td>
<td style="vertical-align: middle;">5</td>
<td style="vertical-align: middle;">Tool Z</td>
<td style="vertical-align: middle;">79.99</td>
<td style="vertical-align: middle;">tools</td>
<td style="vertical-align: middle;">False</td>
</tr>
</tbody>
</table>
`, `
</div>`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">product_id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">price</th>
<th data-quarto-table-cell-role="th">category</th>
<th data-quarto-table-cell-role="th">in_stock</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">1</td>
<td style="vertical-align: middle;">Widget A</td>
<td style="vertical-align: middle;">29.99</td>
<td style="vertical-align: middle;">widgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">2</td>
<td style="vertical-align: middle;">Widget B</td>
<td style="vertical-align: middle;">39.99</td>
<td style="vertical-align: middle;">widgets</td>
<td style="vertical-align: middle;">False</td>
</tr>
<tr>
<td style="vertical-align: middle;">3</td>
<td style="vertical-align: middle;">Gadget X</td>
<td style="vertical-align: middle;">149.99</td>
<td style="vertical-align: middle;">gadgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">4</td>
<td style="vertical-align: middle;">Gadget Y</td>
<td style="vertical-align: middle;">199.99</td>
<td style="vertical-align: middle;">gadgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">5</td>
<td style="vertical-align: middle;">Tool Z</td>
<td style="vertical-align: middle;">79.99</td>
<td style="vertical-align: middle;">tools</td>
<td style="vertical-align: middle;">False</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">price</th>
<th data-quarto-table-cell-role="th">sale_price</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Widget A</td>
<td style="vertical-align: middle;">29.99</td>
<td style="vertical-align: middle;">26.991</td>
</tr>
<tr>
<td style="vertical-align: middle;">Widget B</td>
<td style="vertical-align: middle;">39.99</td>
<td style="vertical-align: middle;">35.991</td>
</tr>
<tr>
<td style="vertical-align: middle;">Gadget X</td>
<td style="vertical-align: middle;">149.99</td>
<td style="vertical-align: middle;">134.991</td>
</tr>
<tr>
<td style="vertical-align: middle;">Gadget Y</td>
<td style="vertical-align: middle;">199.99</td>
<td style="vertical-align: middle;">179.991</td>
</tr>
<tr>
<td style="vertical-align: middle;">Tool Z</td>
<td style="vertical-align: middle;">79.99</td>
<td style="vertical-align: middle;">71.991</td>
</tr>
</tbody>
</table>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th">product_id</th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">price</th>
<th data-quarto-table-cell-role="th">category</th>
<th data-quarto-table-cell-role="th">in_stock</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">1</td>
<td style="vertical-align: middle;">Widget A</td>
<td style="vertical-align: middle;">29.99</td>
<td style="vertical-align: middle;">widgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">2</td>
<td style="vertical-align: middle;">Widget B</td>
<td style="vertical-align: middle;">39.99</td>
<td style="vertical-align: middle;">widgets</td>
<td style="vertical-align: middle;">False</td>
</tr>
<tr>
<td style="vertical-align: middle;">3</td>
<td style="vertical-align: middle;">Gadget X</td>
<td style="vertical-align: middle;">149.99</td>
<td style="vertical-align: middle;">gadgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">4</td>
<td style="vertical-align: middle;">Gadget Y</td>
<td style="vertical-align: middle;">199.99</td>
<td style="vertical-align: middle;">gadgets</td>
<td style="vertical-align: middle;">True</td>
</tr>
<tr>
<td style="vertical-align: middle;">5</td>
<td style="vertical-align: middle;">Tool Z</td>
<td style="vertical-align: middle;">79.99</td>
<td style="vertical-align: middle;">tools</td>
<td style="vertical-align: middle;">False</td>
</tr>
</tbody>
</table>
`, `<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
    .dataframe tbody tr th {
        vertical-align: top;
    }
    .dataframe thead th {
        text-align: right;
    }
</style>
`, `
<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">name</th>
<th data-quarto-table-cell-role="th">price</th>
<th data-quarto-table-cell-role="th">sale_price</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">0</td>
<td style="vertical-align: middle;">Widget A</td>
<td style="vertical-align: middle;">29.990000</td>
<td style="vertical-align: middle;">26.990999</td>
</tr>
<tr>
<td style="vertical-align: middle;">1</td>
<td style="vertical-align: middle;">Widget B</td>
<td style="vertical-align: middle;">39.990002</td>
<td style="vertical-align: middle;">35.991001</td>
</tr>
<tr>
<td style="vertical-align: middle;">2</td>
<td style="vertical-align: middle;">Gadget X</td>
<td style="vertical-align: middle;">149.990005</td>
<td style="vertical-align: middle;">134.990997</td>
</tr>
<tr>
<td style="vertical-align: middle;">3</td>
<td style="vertical-align: middle;">Gadget Y</td>
<td style="vertical-align: middle;">199.990005</td>
<td style="vertical-align: middle;">179.990997</td>
</tr>
<tr>
<td style="vertical-align: middle;">4</td>
<td style="vertical-align: middle;">Tool Z</td>
<td style="vertical-align: middle;">79.989998</td>
<td style="vertical-align: middle;">71.990997</td>
</tr>
</tbody>
</table>
`, `
</div>`, `
<table>
<thead>
<tr>
<th>Scenario</th>
<th>Recommendation</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: middle;">Data lake / analytics data</td>
<td style="vertical-align: middle;">Use <code>create_table(source=path)</code></td>
</tr>
<tr>
<td style="vertical-align: middle;">ML feature stores</td>
<td style="vertical-align: middle;">Use <code>create_table</code> with <code>primary_key</code></td>
</tr>
<tr>
<td style="vertical-align: middle;">Small datasets</td>
<td style="vertical-align: middle;">Consider CSV for simplicity</td>
</tr>
<tr>
<td style="vertical-align: middle;">Streaming data</td>
<td style="vertical-align: middle;">Use direct <code>insert()</code> instead</td>
</tr>
</tbody>
</table>
`];

Load columnar data from Parquet files into Pixeltable tables for
processing and analysis.

## Problem

You have data stored in Parquet format—a common format for analytics,
data lakes, and ML pipelines. You need to load this data for processing
with AI models or combining with other data sources.

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[0] }} />

## Solution

**What’s in this recipe:**

* Import Parquet files directly into tables
* Export tables to Parquet for external tools
* Handle schema type overrides

You use `pxt.create_table()` with a `source` parameter to create a table
from a Parquet file. Pixeltable infers column types from the Parquet
schema automatically.

### Setup

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
%pip install -qU pixeltable pyarrow pandas
```

### Create sample Parquet file

First, create a sample Parquet file to demonstrate the import process:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
import pandas as pd
import pixeltable as pxt
import tempfile
from pathlib import Path

# Create sample data
sample_data = pd.DataFrame(
    {
        'product_id': [1, 2, 3, 4, 5],
        'name': [
            'Widget A',
            'Widget B',
            'Gadget X',
            'Gadget Y',
            'Tool Z',
        ],
        'price': [29.99, 39.99, 149.99, 199.99, 79.99],
        'category': ['widgets', 'widgets', 'gadgets', 'gadgets', 'tools'],
        'in_stock': [True, False, True, True, False],
    }
)

# Save to temporary Parquet file
temp_dir = tempfile.mkdtemp()
parquet_path = Path(temp_dir) / 'products.parquet'
sample_data.to_parquet(parquet_path, index=False)
sample_data
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[1] }} />

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[2] }} />

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[3] }} />

### Import Parquet file

Use `create_table` with the `source` parameter to create a table
directly from the Parquet file:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Create a fresh directory
pxt.drop_dir('parquet_demo', force=True)
pxt.create_dir('parquet_demo')
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
  Created directory 'parquet\_demo'.
  \<pixeltable.catalog.dir.Dir at 0x17f0ca920>
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Import Parquet file into a new table
products = pxt.create_table(
    'parquet_demo/products', source=str(parquet_path)
)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created table 'products'.

  Inserting rows into \`products\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`products\`: 5 rows \[00:00, 653.18 rows/s]
  Inserted 5 rows with 0 errors.
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# View imported data
products.collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[4] }} />

### Add computed columns

Once imported, you can add computed columns like any other Pixeltable
table:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Add a computed column for discounted price
products.add_computed_column(sale_price=products.price * 0.9)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Added 5 column values with 0 errors.
  5 rows updated, 10 values computed.
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# View with computed column
products.select(
    products.name, products.price, products.sale_price
).collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[5] }} />

### Import with primary key

Specify a primary key when you need upsert behavior or unique
constraints:

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Import with a primary key
products_pk = pxt.create_table(
    'parquet_demo/products_with_pk',
    source=str(parquet_path),
    primary_key='product_id',
)
```

<pre style={{ 'margin': '-20px 20px 0px 20px', 'padding': '0px', 'background-color': 'transparent', 'color': 'black' }}>
  Created table 'products\_with\_pk'.

  Inserting rows into \`products\_with\_pk\`: 0 rows \[00:00, ? rows/s]
  Inserting rows into \`products\_with\_pk\`: 5 rows \[00:00, 1548.97 rows/s]
  Inserted 5 rows with 0 errors.
</pre>

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# View the table
products_pk.collect()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[6] }} />

### Export table to Parquet

Export your processed data back to Parquet for use with other toolee

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Export to Parquet (note: image columns require inline_images=True)
export_path = Path(temp_dir) / 'exported_products'

pxt.io.export_parquet(
    products.select(products.name, products.price, products.sale_price),
    parquet_path=export_path,
)
```

```python theme={"theme":{"light":"light-plus","dark":"dark-plus"}}
# Verify export by reading back
import pyarrow.parquet as pq

exported_table = pq.read_table(export_path)
exported_table.to_pandas()
```

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[7] }} />

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[8] }} />

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[9] }} />

## Explanation

**When to use Parquet import:**

<div style={{ 'margin': '0px 20px 0px 20px' }} dangerouslySetInnerHTML={{ __html: quartoRawHtml[10] }} />

**Key features:**

* Automatic schema inference from Parquet metadata
* Support for partitioned datasets (directory of files)
* Export with `pxt.io.export_parquet` for interoperability
* Primary key support for upsert workflows

## See also

* [Import CSV
  files](/howto/cookbooks/data/data-import-csv) -
  For CSV and Excel imports
* [Import JSON
  files](/howto/cookbooks/data/data-import-json) -
  For JSON data
