Skip to main content
Open in Kaggle  Open in Colab  Download Notebook
This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.
Import images, videos, and audio files from S3, GCS, HTTP URLs, or local paths into Pixeltable tables.

Problem

You have media files stored in cloud storage (S3, GCS) or accessible via HTTP URLs. You need to process these files with AI models without downloading them all upfront.

Solution

What’s in this recipe:
  • Reference media files by URL (S3, HTTP, local paths)
  • Automatic caching of remote files on access
  • Process files lazily without bulk downloads
You insert media URLs as references. Pixeltable stores the URLs and automatically downloads/caches files when you access them through queries or computed columns.

Setup

%pip install -qU pixeltable boto3
import pixeltable as pxt
# Create a fresh directory
pxt.drop_dir('cloud_demo', force=True)
pxt.create_dir('cloud_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘cloud_demo’.
<pixeltable.catalog.dir.Dir at 0x302d8bd10>

Load images from HTTP URLs

Reference images by URL—Pixeltable downloads them on demand:
# Create a table with image column
images = pxt.create_table('cloud_demo.images', {'image': pxt.Image})
Created table ‘images’.
# Insert images by URL (HTTP)
image_urls = [
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg',
]

images.insert([{'image': url} for url in image_urls])
Inserting rows into `images`: 3 rows [00:00, 608.28 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 6 values computed.
# View images - files are downloaded and cached on access
images.collect()

Load videos from S3

Reference videos in S3 buckets (using public Multimedia Commons bucket):
# Create a table with video column
videos = pxt.create_table('cloud_demo.videos', {'video': pxt.Video})
Created table ‘videos’.
# Insert videos by S3 URL (public bucket, no credentials needed)
s3_prefix = 's3://multimedia-commons/'
video_paths = [
    'data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4',
    'data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4',
]

videos.insert([{'video': s3_prefix + path} for path in video_paths])
Inserting rows into `videos`: 2 rows [00:00, 917.09 rows/s]
Inserted 2 rows with 0 errors.
2 rows inserted, 4 values computed.
# View videos - downloaded and cached on access
videos.collect()

Add computed columns on remote media

Process remote media with computed columns—files are fetched automatically:
# Add computed columns for image properties
images.add_computed_column(width=images.image.width)
images.add_computed_column(height=images.image.height)
Added 3 column values with 0 errors.
Added 3 column values with 0 errors.
3 rows updated, 6 values computed.
# View with computed properties
images.select(images.image, images.width, images.height).collect()

Supported URL formats

Pixeltable supports multiple URL schemes for media files:
*Configure AWS/GCP credentials via environment variables or config files.

Explanation

How caching works:
  1. URLs are stored as references in the table
  2. Files are downloaded on first access (query or computed column)
  3. Downloaded files are cached in ~/.pixeltable/file_cache/
  4. Cache uses LRU eviction when space is needed
Benefits of URL-based storage:
  • Lazy loading - Only download files when needed
  • Deduplication - Same URL is cached once
  • Incremental processing - Add files without bulk downloads
  • Cloud-native - Works directly with object storage
For private S3 buckets: Configure AWS credentials using standard methods:
  • Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  • AWS credentials file (~/.aws/credentials)
  • IAM roles (when running on EC2/ECS)

See also