This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Load columnar data from Parquet files into Pixeltable tables for
processing and analysis.
Problem
You have data stored in Parquet format—a common format for analytics,
data lakes, and ML pipelines. You need to load this data for processing
with AI models or combining with other data sources.
Solution
What’s in this recipe:
- Import Parquet files directly into tables
- Export tables to Parquet for external tools
- Handle schema type overrides
You use pxt.create_table() with a source parameter to create a table
from a Parquet file. Pixeltable infers column types from the Parquet
schema automatically.
Setup
%pip install -qU pixeltable pyarrow pandas
import pixeltable as pxt
import pandas as pd
import tempfile
from pathlib import Path
Create sample Parquet file
First, create a sample Parquet file to demonstrate the import process:
# Create sample data
sample_data = pd.DataFrame({
'product_id': [1, 2, 3, 4, 5],
'name': ['Widget A', 'Widget B', 'Gadget X', 'Gadget Y', 'Tool Z'],
'price': [29.99, 39.99, 149.99, 199.99, 79.99],
'category': ['widgets', 'widgets', 'gadgets', 'gadgets', 'tools'],
'in_stock': [True, False, True, True, False]
})
# Save to temporary Parquet file
temp_dir = tempfile.mkdtemp()
parquet_path = Path(temp_dir) / 'products.parquet'
sample_data.to_parquet(parquet_path, index=False)
sample_data
Import Parquet file
Use create_table with the source parameter to create a table
directly from the Parquet file:
# Create a fresh directory
pxt.drop_dir('parquet_demo', force=True)
pxt.create_dir('parquet_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘parquet_demo’.
<pixeltable.catalog.dir.Dir at 0x17f0ca920>
# Import Parquet file into a new table
products = pxt.create_table(
'parquet_demo.products',
source=str(parquet_path)
)
Created table ‘products’.Inserting rows into `products`: 0 rows [00:00, ? rows/s]
Inserting rows into `products`: 5 rows [00:00, 653.18 rows/s]
Inserted 5 rows with 0 errors.
# View imported data
products.collect()
Add computed columns
Once imported, you can add computed columns like any other Pixeltable
table:
# Add a computed column for discounted price
products.add_computed_column(sale_price=products.price * 0.9)
Added 5 column values with 0 errors.
5 rows updated, 10 values computed.
# View with computed column
products.select(products.name, products.price, products.sale_price).collect()
Import with primary key
Specify a primary key when you need upsert behavior or unique
constraints:
# Import with a primary key
products_pk = pxt.create_table(
'parquet_demo.products_with_pk',
source=str(parquet_path),
primary_key='product_id'
)
Created table ‘products_with_pk’.Inserting rows into `products_with_pk`: 0 rows [00:00, ? rows/s]
Inserting rows into `products_with_pk`: 5 rows [00:00, 1548.97 rows/s]
Inserted 5 rows with 0 errors.
# View the table
products_pk.collect()
Export table to Parquet
Export your processed data back to Parquet for use with other toolee
# Export to Parquet (note: image columns require inline_images=True)
export_path = Path(temp_dir) / 'exported_products'
pxt.io.export_parquet(
products.select(products.name, products.price, products.sale_price),
parquet_path=export_path
)
# Verify export by reading back
import pyarrow.parquet as pq
exported_table = pq.read_table(export_path)
exported_table.to_pandas()
Explanation
When to use Parquet import:
Key features:
- Automatic schema inference from Parquet metadata
- Support for partitioned datasets (directory of files)
- Export with
pxt.io.export_parquet for interoperability
- Primary key support for upsert workflows
See also