Load media from S3 and other cloud storage

This documentation page is also available as an interactive notebook. You can launch the notebook in Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the above links.

Import images, videos, and audio files from S3, GCS, HTTP URLs, or local paths into Pixeltable tables.

Problem

You have media files stored in cloud storage (S3, GCS) or accessible via HTTP URLs. You need to process these files with AI models without downloading them all upfront.

Solution

What’s in this recipe:

Reference media files by URL (S3, HTTP, local paths)
Automatic caching of remote files on access
Process files lazily without bulk downloads

You insert media URLs as references. Pixeltable stores the URLs and automatically downloads/caches files when you access them through queries or computed columns.

Setup

%pip install -qU pixeltable boto3

import pixeltable as pxt

# Create a fresh directory
pxt.drop_dir('cloud_demo', force=True)
pxt.create_dir('cloud_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘cloud_demo’.
<pixeltable.catalog.dir.Dir at 0x10d31f710>

Load images from HTTP URLs

Reference images by URL—Pixeltable downloads them on demand:

# Create a table with image column
images = pxt.create_table('cloud_demo/images', {'image': pxt.Image})

Created table ‘images’.

# Insert images by URL (HTTP)
image_urls = [
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg',
]

images.insert([{'image': url} for url in image_urls])

Inserting rows into `images`: 3 rows [00:00, 767.91 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 6 values computed.

# View images - files are downloaded and cached on access
images.collect()

Load videos from S3

Reference videos in S3 buckets (using public Multimedia Commons bucket):

# Create a table with video column
videos = pxt.create_table('cloud_demo/videos', {'video': pxt.Video})

Created table ‘videos’.

# Insert videos by S3 URL (public bucket, no credentials needed)
s3_prefix = 's3://multimedia-commons/'
video_paths = [
    'data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4',
    'data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4',
]

videos.insert([{'video': s3_prefix + path} for path in video_paths])

Inserting rows into `videos`: 2 rows [00:00, 1477.13 rows/s]
Inserted 2 rows with 0 errors.
2 rows inserted, 4 values computed.

# View videos - downloaded and cached on access
videos.collect()

Add computed columns on remote media

Process remote media with computed columns—files are fetched automatically:

# Add computed columns for image properties
images.add_computed_column(width=images.image.width)
images.add_computed_column(height=images.image.height)

Added 3 column values with 0 errors.
Added 3 column values with 0 errors.
3 rows updated, 6 values computed.

# View with computed properties
images.select(images.image, images.width, images.height).collect()

Generate presigned URLs for serving media

When you store media in private cloud storage, you need presigned URLs to serve files over HTTP. The presigned_url function converts storage URIs to time-limited, publicly accessible URLs:

import pixeltable.functions as pxtf

# Generate presigned URLs for videos (1-hour expiration)
videos.select(
    videos.video,
    original_uri=videos.video.fileurl,
    http_url=pxtf.net.presigned_url(videos.video.fileurl, 3600),
).collect()

# Store presigned URLs as computed column for API responses
videos.add_computed_column(
    serving_url=pxtf.net.presigned_url(
        videos.video.fileurl, 86400
    )  # 24-hour expiration
)

Added 2 column values with 0 errors.
2 rows updated, 4 values computed.

Use cases for presigned URLs:

Serve private media in web applications without exposing credentials
Generate download links for end users
Integrate with CDNs or video players that require HTTP URLs

Provider limitations:

Note: HTTP/HTTPS URLs pass through unchanged (already publicly accessible).

Supported URL formats

Pixeltable supports multiple URL schemes for media files:

*Configure AWS/GCP credentials via environment variables or config files.

Explanation

How caching works:

URLs are stored as references in the table
Files are downloaded on first access (query or computed column)
Downloaded files are cached in ~/.pixeltable/file_cache/
Cache uses LRU eviction when space is needed

Benefits of URL-based storage:

Lazy loading - Only download files when needed
Deduplication - Same URL is cached once
Incremental processing - Add files without bulk downloads
Cloud-native - Works directly with object storage

For private S3 buckets: Configure AWS credentials using standard methods:

Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
AWS credentials file (~/.aws/credentials)
IAM roles (when running on EC2/ECS)

Welcome to Pixeltable

Core Concepts

How-To

Load media from S3 and other cloud storage

Problem

Solution

Setup

Load images from HTTP URLs

Load videos from S3

Add computed columns on remote media

Generate presigned URLs for serving media

Supported URL formats

Explanation

See also

Welcome to Pixeltable

Core Concepts

How-To

​Problem

​Solution

​Setup

​Load images from HTTP URLs

​Load videos from S3

​Add computed columns on remote media

​Generate presigned URLs for serving media

​Supported URL formats

​Explanation

​See also

Problem

Solution

Setup

Load images from HTTP URLs

Load videos from S3

Add computed columns on remote media

Generate presigned URLs for serving media

Supported URL formats

Explanation

See also