Section 1: Tables and Data Operations
Welcome to Section 1 of the Pixeltable Fundamentals tutorial, Tables and Data Operations. In this section, we’ll learn how to:- Create and manage tables: Understand Pixeltable’s table structure, create and modify tables, and work with table schemas
- Manipulate data: Insert, update, and delete data within tables, and retrieve data from tables into Python variables
- Filter and select data: Use
where(),select(), andorder_by()to query for specific rows and columns - Import data from CSV files and other file types
Tables
All data in Pixeltable is stored in tables. At a high level, a Pixeltable table behaves similarly to an ordinary SQL database table, but with many additional capabilities to support complex AI workflows. We’ll introduce those advanced capabilities gradually throughout this tutorial; in this section, the focus is on basic table and data operations. Tables in Pixeltable are grouped into directories, which are simply user-defined namespaces. The following command creates a new directory,fundamentals, which we’ll use to store the tables in our tutorial.
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory ‘fundamentals’.
Now let’s create our first table. To create a table, we must give it a
name and a schema that describes the table structure. Note that
prefacing the name with fundamentals causes it to be placed in our
newly-created directory.
Created table ‘films’.
To insert data into a table, we use the insert() method, passing it a
list of Python dicts.
Inserting rows into `films`: 3 rows [00:00, 572.84 rows/s]
Inserted 3 rows with 0 errors.
3 rows inserted, 3 values computed.
If you’re inserting just a single row, you can use an alternate syntax
that is sometimes more convenient.
Inserting rows into `films`: 1 rows [00:00, 318.76 rows/s]
Inserted 1 row with 0 errors.
1 row inserted, 1 value computed.
We can peek at the data in our table with the collect() method, which
retrieves all the rows in the table.
update() and delete() methods for modifying
and removing data from a table; we’ll see examples of them shortly.
Filtering and Selecting Data
Often you want to select only certain rows and/or certain columns in a table. You can do this with thewhere() and select() methods.
where() and
select(), such as films_t.year. These are column references that
point to specific columns within a table. In place of films_t.year,
you can also use dictionary syntax and type films_t['year'], which
means exactly the same thing but is sometimes more convenient.
revenue numbers
are given in millions of dollars. Let’s say we wanted to select revenue
in thousands of dollars instead; we could do that as follows:
col_1. You can
assign it a more informative name with Python keyword syntax:
Tables are Persistent
This is a good time to mention a few key differences between Pixeltable tables and other familiar datastructures, such as Python dicts or Pandas dataframes. First, Pixeltable is persistent. Unlike in-memory Python libraries such as Pandas, Pixeltable is a database. When you reset a notebook kernel or start a new Python session, you’ll have access to all the data you’ve stored previously in Pixeltable. Let’s demonstrate this by using the IPython%reset -f command to clear out all our notebook variables,
so that films_t is no longer defined.
NameError: name ‘films_t’ is not defined [0;31m---------------------------------------------------------------------------[0m [0;31mNameError[0m Traceback (most recent call last) Cell [0;32mIn[11], line 2[0m [1;32m 1[0m get_ipython()[38;5;241m.[39mrun_line_magic([38;5;124m’[39m[38;5;124mreset[39m[38;5;124m’[39m, [38;5;124m’[39m[38;5;124m-f[39m[38;5;124m’[39m) [0;32m----> 2[0m [43mfilms_t[49m[38;5;241m.[39mcollect() [38;5;66;03m# Throws an exception now[39;00m[0;31mNameError[0m: name ‘films_t’ is not definedThe
films_t variable (along with all other variables in our Python
session) has been cleared out - but that’s ok, because it wasn’t the
source of record for our data. The films_t variable is just a
reference to the underlying database table. We can recover it with the
get_table command, referencing the films table by name.
pxt.ls() command. Let’s use it to see the contents of the
fundamentals directory.
Note that if you’re running Pixeltable on colab
or kaggle, the database will persist only for as long as your
colab/kaggle session remains active. If you’re running it locally or on
your own server, then your database will persist indefinitely (until you
actively delete it).
Tables are Typed
The second major difference is that Pixeltable is strongly typed. Because Pixeltable is a database, every column has a data type: that’s why we specifiedString, Int, and Float for the three columns when
we created the table. These type specifiers are mandatory when
creating tables, and they become part of the table schema. You can
always see the table schema with the describe() method.
films_t to see the schema; its
output is identical to films_t.describe().
In addition to The media types
We’ll see examples of each of these types later in the tutorial.
String,
Int, and Float, Pixeltable provides several
additional data types:Bool, whose values are True or
False;Array for
numerical arrays;Json, for
lists or dicts that correspond to valid JSON structures;
andImage, Video, Audio, and
Document.Computed With. We’ll explain what this means in the next
section of the tutorial, Computed
Columns.
All of the methods we’ve discussed so far, such as insert() and
get_table(), are documented in the Pixeltable
SDK Documentation. The
following pages are particularly relevant to this section of the
tutorial:
- pixeltable package reference
- pxt.Table class reference
A Real-World Example: Earthquake Data
Now let’s dive a little deeper into Pixeltable’s data operations. To showcase all the features, it’ll be helpful to have a real-world dataset, rather than our toy dataset with four movies. The dataset we’ll be using consists of Earthquake data drawn from the US Geological Survey: all recorded Earthquakes that occurred within 100 km of Seattle, Washington, between January 1, 2023 and June 30, 2024. The dataset is in CSV format, and we can load it into Pixeltable by usingcreate_table() with the source parameter, which creates a new
Pixeltable table from the contents of a CSV file.
Created table ‘earthquakes’.
Inserting rows into `earthquakes`: 1823 rows [00:00, 19554.24 rows/s]
Inserted 1823 rows with 0 errors.
In Pixeltable, you can always import external
data by giving a URL instead of a local file path. This applies to CSV
datasets, media files (such images and video), and other types of
content. The URL will often be an
http:// URL, but it can
also be an s3:// URL referencing an S3 bucket.Pixeltable’s
create_table() function
with the source parameter can import data from various
formats including CSV, Excel, and Hugging Face datasets. You can also
use source to import from a Pandas dataframe. For more
details, see the
pixeltable.io
package reference.limit() method.
head()
and tail() methods. Pixeltable keeps track of the insertion order of
all its data, and head() and tail() will always return the earliest
inserted and most recently inserted rows in a table, respectively.
head(n) and
limit(n).collect() appear similar in this example. But
head() always returns the earliest rows in a table,
whereas limit() makes no promises about the ordering of its
results (unless you specify an order_by() clause - more on
this below).schema_overrides parameter of import_csv().
The following examples showcase some common data operations.
1823
isin operator to select just those values
that appear within a particular list:
>= and isin, a Pixeltable
where clause can also contain more complex operations. For example,
the location column in our dataset is a string that contains a lot of
information, but in a relatively unstructured way. Suppose we wanted to
see all Earthquakes in the vicinity of Rainier, Washington; one way to
do this is with the contains() method:
max() and min():
Extracting Data from Tables into Python/Pandas
Sometimes it’s handy to pull out data from a table into a Python object. We’ve actually already done this; the call tocollect() returns an
in-memory result set, which we can then dereference in various ways. For
example:
{‘id’: 0,
‘magnitude’: 1.15,
‘location’: ‘10 km NW of Belfair, Washington’,
‘timestamp’: datetime.datetime(2023, 1, 1, 8, 10, 37, 50000, tzinfo=zoneinfo.ZoneInfo(key=‘America/Los_Angeles’)),
‘longitude’: -122.93,
‘latitude’: 47.51}
[datetime.datetime(2023, 1, 1, 8, 10, 37, 50000, tzinfo=zoneinfo.ZoneInfo(key=‘America/Los_Angeles’)),
datetime.datetime(2023, 1, 2, 1, 2, 43, 950000, tzinfo=zoneinfo.ZoneInfo(key=‘America/Los_Angeles’)),
datetime.datetime(2023, 1, 2, 12, 5, 1, 420000, tzinfo=zoneinfo.ZoneInfo(key=‘America/Los_Angeles’)),
datetime.datetime(2023, 1, 2, 12, 45, 14, 220000, tzinfo=zoneinfo.ZoneInfo(key=‘America/Los_Angeles’)),
datetime.datetime(2023, 1, 2, 13, 19, 27, 200000, tzinfo=zoneinfo.ZoneInfo(key=‘America/Los_Angeles’))]
count 5.000000
mean 0.744000
std 0.587988
min 0.200000
25% 0.290000
50% 0.520000
75% 1.150000
max 1.560000
Name: magnitude, dtype: float64
collect() without a preceding limit() returns the entire contents of
a query or table. Be careful! For very large tables, this could result
in out-of-memory errors. In this example, the 1823 rows in the table fit
comfortably into a dataframe.
count 1823.000000
mean 0.900378
std 0.625492
min -0.830000
25% 0.420000
50% 0.850000
75% 1.310000
max 4.300000
Name: magnitude, dtype: float64
Adding Columns
Like other database tables, Pixeltable tables aren’t fixed entities: they’re meant to evolve over time. Suppose we want to add a new column to hold user-specified comments about particular earthquake events. We can do this with theadd_column() method:
Added 1823 column values with 0 errors.
1823 rows updated, 1823 values computed.
Here, note is the column name, and pxt.String specifies the type of
the new column.
Added 1823 column values with 0 errors.
1823 rows updated, 1823 values computed.
Let’s have a look at the revised schema.
Updating Rows in a Table
Table rows can be modified and deleted with the SQL-likeupdate() and
delete() commands.
Inserting rows into `earthquakes`: 2 rows [00:00, 366.84 rows/s]
2 rows updated, 4 values computed.
update() can also accept an expression, rather than a constant value.
For example, suppose we wanted to shorten the location strings by
replacing every occurrence of Washington with WA. One way to do this
is with an update() clause, using a Pixeltable expression with the
replace() method.
Inserting rows into `earthquakes`: 1823 rows [00:00, 21494.07 rows/s]
1823 rows updated, 1823 values computed.
update() clause takes a Python
dictionary, but its values can be either constants such as
'[email protected]', or more complex expressions such as
eq_t.location.replace('Washington', 'WA'). Also notice that if
update() appears without a where() clause, then every row in the
table will be updated, as in the preceding example.
Batch Updates
Thebatch_update() method provides an alternative way to update
multiple rows with different values. With a batch_update(), the
contents of each row are specified by individual dicts, rather than
according to a formula. Here’s a toy example that shows batch_update()
in action.
Inserting rows into `earthquakes`: 3 rows [00:00, 984.58 rows/s]
3 rows updated, 3 values computed.
Deleting Rows
To delete rows from a table, use thedelete() method.
587 rows deleted.
1236
Don’t forget to specify a where() clause when using delete()! If you
run delete() without a where() clause, the entire contents of the
table will be deleted.
1236 rows deleted.
0
Table Versioning
Every table in Pixeltable is versioned: some or all of its modification history is preserved. We’ve seen a reference to this already;pxt.ls()
will show the most recent version along with each table it lists.
revert() to undo the
most recent change to a table and roll back to the previous version.
Let’s try it out: we’ll use it to revert the successive delete() calls
that we just executed.
1236
1823
Multimodal Data
In addition to the structured data we’ve been exploring so far in this tutorial, Pixeltable has native support for media types: images, video, audio, and unstructured documents such as pdfs. Media support is one of Pixeltable’s core capabilities, and we’ll have much more to say about it the upcoming Unstructured Data section of this tutorial. For now, we’ll just give one example to show how media data lives side-by-side with structured data in Pixeltable.Added 1823 column values with 0 errors.
Inserting rows into `earthquakes`: 1 rows [00:00, 192.79 rows/s]
1 row updated, 1 value computed.
Note that in Pixeltable, you can always insert
images into a table by giving the file path or URL of the image (as a
string). It’s not necessary to load the image first; Pixeltable will
manage the loading and caching of images in the background. The same
applies to other media data such as documents and videos.
Directory Hierarchies
So far we’ve only seen an example of a single directory with a table inside it, but one can also put directories inside other directories, in whatever fashion makes the most sense for a given application.Created directory ‘fundamentals.subdir’.
Created directory ‘fundamentals.subdir.subsubdir’.
Created table ‘my_table’.
Deleting Columns, Tables, and Directories
drop_column(), drop_table(), and drop_dir() are used to delete
columns, tables, and directories, respectively.