Data Types and Formats

A guide to working with Pixeltable's type system and supported data formats. Pixeltable provides a unified interface for handling various data types common in AI/ML workflows.

Data Types

Pixeltable TypePython TypeDescriptionExample
pxt.StringstrText data{'text': pxt.String}
pxt.IntintInteger numbers{'count': pxt.Int}
pxt.FloatfloatFloating-point numbers{'score': pxt.Float}
pxt.BoolboolBoolean values{'valid': pxt.Bool}
pxt.Timestampdatetime.datetimeDateTime values{'date': pxt.Timestamp}
pxt.Jsonlist, dictStructured data like lists or dictionaries{'metadata': pxt.Json}
pxt.Arraynp.ndarrayN-dimensional arrays (requires shape and dtype){'embedding': pxt.Array[(512,), pxt.Float]}

Media Types

Pixeltable TypePython TypeSupported FormatsCommon Use Cases
pxt.ImagePIL.Image.ImageImage File FormatsComputer vision, image processing
pxt.Videostr (path)MP4Video analysis, frame extraction
pxt.Audiostr (path)MP3Speech recognition, audio analysis
pxt.Documentstr (path)PDF, MD, HTML, TXTText extraction, RAG applications

For practical examples of working with these types, check out:

Creating Tables with Media Types

# Example from Object Detection Tutorial
videos = pxt.create_table('video_analysis', {
    'video': pxt.Video,
    'thumbnail': pxt.Image[(224, 224), 'RGB'],  # With constraints
    'transcript': pxt.Document,
    'audio': pxt.Audio
})

Specialized Types

Array Type (Essential for ML Models)

# Example from RAG Tutorial
embeddings = pxt.create_table('embeddings', {
    'text': pxt.String,
    'vector': pxt.Array[(768,), pxt.Float]  # For embeddings
})

# For variable-sized tensors
features = pxt.create_table('features', {
    'tensor': pxt.Array[(None, 512), pxt.Float]  # Dynamic first dimension
})

JSON Type (For Structured Data)

# Example from Object Detection Tutorial
detections = pxt.create_table('detections', {
    'model_output': pxt.Json,  # Stores bounding boxes, labels, scores
    'metadata': pxt.Json
})

Working with Types

Image Type Constraints

# From Computer Vision Examples
images = pxt.create_table('image_processing', {
    'original': pxt.Image,  # Any valid image
    'thumbnail': pxt.Image[(224, 224)],  # ResNet input size
    'rgb_image': pxt.Image['RGB'],
    'grayscale': pxt.Image['L']
})

Media File Handling

# Various input formats supported
t.insert({'image': 'https://example.com/image.jpg'})  # URLs
t.insert({'document': '/path/to/doc.pdf'})           # Local files
t.insert({'video': 's3://my-bucket/video.mp4'})      # Cloud storage

Data Import/Export

Import Operations

import pixeltable as pxt

# CSV Import with type overrides
df = pxt.io.import_csv(
    'customers',
    'path/to/data.csv',
    schema_overrides={
        'balance': pxt.Float,
        'status': pxt.String
    }
)

# Excel Import
sales = pxt.io.import_excel(
    'sales_data',                  # Table name
    'sales_2024.xlsx',            # File path or URL
    schema_overrides={            # Optional type overrides
        'revenue': pxt.Float,
        'category': pxt.String
    }      
)

# Parquet Import 
transactions = pxt.io.import_parquet(
    'transactions',               # Table name
    'transactions.parquet',       # File path or URL
    schema_overrides={
        'amount': pxt.Float,
        'timestamp': pxt.Timestamp
    }
)

# Hugging Face Dataset Import
from datasets import load_dataset
dataset = load_dataset('mnist', split='train[:1000]')
mnist = pxt.io.import_huggingface_dataset(
    'mnist_samples',
    dataset,
    schema_overrides={
        'image': pxt.Image,
        'label': pxt.Int
    }
)

For more details on integrating with Hugging Face, see our Hugging Face Integration Guide.

Export Formats

# To pandas DataFrame
df = t.collect().to_pandas()

# To PyTorch Dataset (from Computer Vision Tutorial)
dataset = t.to_pytorch_dataset(
    features=['image'],
    labels=['label']
)

# To COCO format (for object detection)
coco_data = t.to_coco_dataset()

Important Notes for Users

  1. Media Files:

    • Pixeltable handles downloading, caching, and format validation
    • Local paths are managed automatically
    • Supports URLs, local paths, and cloud storage (S3)
    • See Working with External Files
  2. Arrays:

    • Must specify shape and dtype
    • Use None for dynamic dimensions
    • Common in ML workflows for embeddings and features
    • See examples in RAG Operations Tutorial
  3. JSON Data:

    • Flexible storage for structured data
    • Commonly used for model outputs and metadata
    • Supports nested structures
    • See examples in Object Detection Tutorial

Additional Resources

Core Documentation

Integration Guides