This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
Import images, videos, and audio files from S3, GCS, HTTP URLs, or local
paths into Pixeltable tables.
Problem
You have media files stored in cloud storage (S3, GCS) or accessible via
HTTP URLs. You need to process these files with AI models without
downloading them all upfront.
Solution
What’s in this recipe:
- Reference media files by URL (S3, HTTP, local paths)
- Automatic caching of remote files on access
- Process files lazily without bulk downloads
You insert media URLs as references. Pixeltable stores the URLs and
automatically downloads/caches files when you access them through
queries or computed columns.
Setup
%pip install -qU pixeltable boto3
# Create a fresh directory
pxt.drop_dir('cloud_demo', force=True)
pxt.create_dir('cloud_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory ‘cloud_demo’.
<pixeltable.catalog.dir.Dir at 0x302d8bd10>
Load images from HTTP URLs
Reference images by URL—Pixeltable downloads them on demand:
# Create a table with image column
images = pxt.create_table('cloud_demo.images', {'image': pxt.Image})
Created table ‘images’.
# Insert images by URL (HTTP)
image_urls = [
'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg',
'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg',
'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg',
]
images.insert([{'image': url} for url in image_urls])
Inserting rows into `images`: 3 rows [00:00, 608.28 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 6 values computed.
# View images - files are downloaded and cached on access
images.collect()
Load videos from S3
Reference videos in S3 buckets (using public Multimedia Commons bucket):
# Create a table with video column
videos = pxt.create_table('cloud_demo.videos', {'video': pxt.Video})
Created table ‘videos’.
# Insert videos by S3 URL (public bucket, no credentials needed)
s3_prefix = 's3://multimedia-commons/'
video_paths = [
'data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4',
'data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4',
]
videos.insert([{'video': s3_prefix + path} for path in video_paths])
Inserting rows into `videos`: 2 rows [00:00, 917.09 rows/s]
Inserted 2 rows with 0 errors.
2 rows inserted, 4 values computed.
# View videos - downloaded and cached on access
videos.collect()
Process remote media with computed columns—files are fetched
automatically:
# Add computed columns for image properties
images.add_computed_column(width=images.image.width)
images.add_computed_column(height=images.image.height)
Added 3 column values with 0 errors.
Added 3 column values with 0 errors.
3 rows updated, 6 values computed.
# View with computed properties
images.select(images.image, images.width, images.height).collect()
Pixeltable supports multiple URL schemes for media files:
*Configure AWS/GCP credentials via environment variables or config
files.
Explanation
How caching works:
- URLs are stored as references in the table
- Files are downloaded on first access (query or computed column)
- Downloaded files are cached in
~/.pixeltable/file_cache/
- Cache uses LRU eviction when space is needed
Benefits of URL-based storage:
- Lazy loading - Only download files when needed
- Deduplication - Same URL is cached once
- Incremental processing - Add files without bulk downloads
- Cloud-native - Works directly with object storage
For private S3 buckets:
Configure AWS credentials using standard methods:
- Environment variables (
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- AWS credentials file (
~/.aws/credentials)
- IAM roles (when running on EC2/ECS)
See also