Section 1: Tables and Data Operations
Welcome to Section 1 of the Pixeltable Fundamentals tutorial, Tables and Data Operations. In this section, we’ll learn how to:- Create and manage tables: Understand Pixeltable’s table structure, create and modify tables, and work with table schemas
- Manipulate data: Insert, update, and delete data within tables, and retrieve data from tables into Python variables
- Filter and select data: Use
where(),select(), andorder_by()to query for specific rows and columns - Import data from CSV files and other file types
Tables
All data in Pixeltable is stored in tables. At a high level, a Pixeltable table behaves similarly to an ordinary SQL database table, but with many additional capabilities to support complex AI workflows. We’ll introduce those advanced capabilities gradually throughout this tutorial; in this section, the focus is on basic table and data operations. Tables in Pixeltable are grouped into directories, which are simply user-defined namespaces. The following command creates a new directory,fundamentals, which we’ll use to store the tables in our tutorial.
fundamentals causes it to be placed in our
newly-created directory.
insert() method, passing it a
list of Python dicts.
collect() method, which
retrieves all the rows in the table.
| film_name | year | revenue |
|---|---|---|
| Jurassic Park | 1993 | 1037.5 |
| Titanic | 1997 | 2257.8 |
| Avengers: Endgame | 2019 | 2797.5 |
| Inside Out 2 | 2024 | 1462.7 |
update() and delete() methods for modifying
and removing data from a table; we’ll see examples of them shortly.
Filtering and Selecting Data
Often you want to select only certain rows and/or certain columns in a table. You can do this with thewhere() and select() methods.
| film_name | year | revenue |
|---|---|---|
| Titanic | 1997 | 2257.8 |
| Avengers: Endgame | 2019 | 2797.5 |
| film_name | year |
|---|---|
| Jurassic Park | 1993 |
| Titanic | 1997 |
| Avengers: Endgame | 2019 |
| Inside Out 2 | 2024 |
where() and
select(), such as films_t.year. These are column references that
point to specific columns within a table. In place of films_t.year,
you can also use dictionary syntax and type films_t['year'], which
means exactly the same thing but is sometimes more convenient.
| film_name | year |
|---|---|
| Jurassic Park | 1993 |
| Titanic | 1997 |
| Avengers: Endgame | 2019 |
| Inside Out 2 | 2024 |
revenue numbers
are given in millions of dollars. Let’s say we wanted to select revenue
in thousands of dollars instead; we could do that as follows:
| film_name | col_1 |
|---|---|
| Jurassic Park | 1037500. |
| Titanic | 2257800. |
| Avengers: Endgame | 2797500. |
| Inside Out 2 | 1462700. |
col_1. You can
assign it a more informative name with Python keyword syntax:
| film_name | revenue_thousands |
|---|---|
| Jurassic Park | 1037500. |
| Titanic | 2257800. |
| Avengers: Endgame | 2797500. |
| Inside Out 2 | 1462700. |
Tables are Persistent
This is a good time to mention a few key differences between Pixeltable tables and other familiar datastructures, such as Python dicts or Pandas dataframes. First, Pixeltable is persistent. Unlike in-memory Python libraries such as Pandas, Pixeltable is a database. When you reset a notebook kernel or start a new Python session, you’ll have access to all the data you’ve stored previously in Pixeltable. Let’s demonstrate this by using the IPython%reset -f command to clear out all our notebook variables,
so that films_t is no longer defined.
films_t variable (along with all other variables in our Python
session) has been cleared out - but that’s ok, because it wasn’t the
source of record for our data. The films_t variable is just a
reference to the underlying database table. We can recover it with the
get_table command, referencing the films table by name.
| film_name | year | revenue |
|---|---|---|
| Jurassic Park | 1993 | 1037.5 |
| Titanic | 1997 | 2257.8 |
| Avengers: Endgame | 2019 | 2797.5 |
| Inside Out 2 | 2024 | 1462.7 |
pxt.ls() command. Let’s use it to see the contents of the
fundamentals directory.
| Name | Kind | Version | Base | |
|---|---|---|---|---|
| films | table | 2 |
Tables are Typed
The second major difference is that Pixeltable is strongly typed. Because Pixeltable is a database, every column has a data type: that’s why we specifiedString, Int, and Float for the three columns when
we created the table. These type specifiers are mandatory when
creating tables, and they become part of the table schema. You can
always see the table schema with the describe() method.
| table ‘fundamentals.films’ |
| Column Name | Type | Computed With |
|---|---|---|
| film_name | String | |
| year | Int | |
| revenue | Float |
films_t to see the schema; its
output is identical to films_t.describe().
| table 'fundamentals.films' |
| Column Name | Type | Computed With |
|---|---|---|
| film_name | String | |
| year | Int | |
| revenue | Float |
String, Int, and
Float, Pixeltable provides several additional data types:
Bool, whose values are True or
False;
Computed With. We’ll explain what this means in the next
section of the tutorial, Computed Columns.
All of the methods we’ve discussed so far, such as insert() and
get_table(), are documented in the Pixeltable
API Documentation. The
following pages are particularly relevant to this section of the
tutorial: -
pixeltable
package reference -
pxt.Table class
reference - API Cheat
Sheet
A Real-World Example: Earthquake Data
Now let’s dive a little deeper into Pixeltable’s data operations. To showcase all the features, it’ll be helpful to have a real-world dataset, rather than our toy dataset with four movies. The dataset we’ll be using consists of Earthquake data drawn from the US Geological Survey: all recorded Earthquakes that occurred within 100 km of Seattle, Washington, between January 1, 2023 and June 30, 2024. The dataset is in CSV format, and we can load it into Pixeltable by usingcreate_table() with the source parameter, which creates a new
Pixeltable table from the contents of a CSV file.
http:// URL, but it can also be an s3://
URL referencing an S3 bucket.
Pixeltable’s create_table() function with the
source parameter can import data from various formats
including CSV, Excel, and Hugging Face datasets. You can also use
source to import from a Pandas dataframe. For more details,
see the
pixeltable.io
package reference.
Let’s have a peek at our new dataset. The dataset contains 1823 rows,
and we probably don’t want to display them all at once. We can limit our
query to fewer rows with the limit() method.
| id | magnitude | location | timestamp | longitude | latitude |
|---|---|---|---|---|---|
| 0 | 1.15 | 10 km NW of Belfair, Washington | 2023-01-01 08:10:37.050000-08:00 | -122.93 | 47.51 |
| 1 | 0.29 | 23 km ENE of Ashford, Washington | 2023-01-02 01:02:43.950000-08:00 | -121.76 | 46.85 |
| 2 | 0.2 | 23 km ENE of Ashford, Washington | 2023-01-02 12:05:01.420000-08:00 | -121.75 | 46.86 |
| 3 | 0.52 | 15 km NNE of Ashford, Washington | 2023-01-02 12:45:14.220000-08:00 | -121.95 | 46.89 |
| 4 | 1.56 | 0 km WSW of Esperance, Washington | 2023-01-02 13:19:27.200000-08:00 | -122.36 | 47.79 |
head()
and tail() methods. Pixeltable keeps track of the insertion order of
all its data, and head() and tail() will always return the earliest
inserted and most recently inserted rows in a table, respectively.
| id | magnitude | location | timestamp | longitude | latitude |
|---|---|---|---|---|---|
| 0 | 1.15 | 10 km NW of Belfair, Washington | 2023-01-01 08:10:37.050000-08:00 | -122.93 | 47.51 |
| 1 | 0.29 | 23 km ENE of Ashford, Washington | 2023-01-02 01:02:43.950000-08:00 | -121.76 | 46.85 |
| 2 | 0.2 | 23 km ENE of Ashford, Washington | 2023-01-02 12:05:01.420000-08:00 | -121.75 | 46.86 |
| 3 | 0.52 | 15 km NNE of Ashford, Washington | 2023-01-02 12:45:14.220000-08:00 | -121.95 | 46.89 |
| 4 | 1.56 | 0 km WSW of Esperance, Washington | 2023-01-02 13:19:27.200000-08:00 | -122.36 | 47.79 |
| id | magnitude | location | timestamp | longitude | latitude |
|---|---|---|---|---|---|
| 1818 | 1.7 | 14 km W of Skokomish, Washington | 2024-06-29 08:55:50.030000-07:00 | -123.35 | 47.32 |
| 1819 | 1.06 | 7 km E of Lake McMurray, Washington | 2024-06-29 12:15:19.130000-07:00 | -122.13 | 48.31 |
| 1820 | 0.48 | 4 km E of Duvall, Washington | 2024-06-30 09:15:43.020000-07:00 | -121.93 | 47.75 |
| 1821 | 0.46 | 12 km NE of Ashford, Washington | 2024-06-30 10:05:15.410000-07:00 | -121.93 | 46.84 |
| 1822 | 0.72 | 6 km ENE of Oso, Washington | 2024-06-30 11:12:41.900000-07:00 | -121.84 | 48.28 |
head(n) and limit(n).collect() appear similar
in this example. But head() always returns the
earliest rows in a table, whereas limit() makes no
promises about the ordering of its results (unless you specify an
order_by() clause - more on this below).
Let’s also peek at the schema:
| table ‘fundamentals.earthquakes’ |
| Column Name | Type | Computed With |
|---|---|---|
| id | Required[Int] | |
| magnitude | Float | |
| location | String | |
| timestamp | Timestamp | |
| longitude | Float | |
| latitude | Float |
schema_overrides parameter of import_csv().
The following examples showcase some common data operations.
| id | magnitude | location | timestamp | longitude | latitude |
|---|---|---|---|---|---|
| 1002 | 4.3 | Port Townsend, WA | 2023-10-09 02:21:08.960000-07:00 | -122.73 | 48.04 |
| 1226 | 4.04 | 6 km W of Quilcene, Washington | 2023-12-24 15:14:04.220000-08:00 | -122.96 | 47.82 |
| 699 | 3.91 | 9 km NNE of Snoqualmie, Washington | 2023-08-08 10:17:23.910000-07:00 | -121.77 | 47.6 |
| 1281 | 3.48 | 7 km SSW of River Road, Washington | 2024-01-15 07:25:05.920000-08:00 | -123.17 | 48. |
| 1355 | 3.42 | 17 km WSW of Brinnon, Washington | 2024-02-16 16:30:18.830000-08:00 | -123.09 | 47.59 |
| id | magnitude | location | timestamp | longitude | latitude |
|---|---|---|---|---|---|
| 699 | 3.91 | 9 km NNE of Snoqualmie, Washington | 2023-08-08 10:17:23.910000-07:00 | -121.77 | 47.6 |
| 799 | 2.86 | 5 km E of Ashford, Washington | 2023-08-27 10:10:23.770000-07:00 | -121.96 | 46.77 |
| 710 | 2.84 | 8 km ENE of Fall City, Washington | 2023-08-08 11:51:12.750000-07:00 | -121.79 | 47.6 |
| 577 | 2.79 | 0 km NE of Maple Valley, Washington | 2023-07-04 15:52:54.430000-07:00 | -122.04 | 47.4 |
| 769 | 2.73 | 16 km NE of Ashford, Washington | 2023-08-22 23:44:12.250000-07:00 | -121.88 | 46.87 |
isin operator to select just those values
that appear within a particular list:
| id | magnitude | location | timestamp | longitude | latitude |
|---|---|---|---|---|---|
| 123 | 1.23 | 7 km SW of Rainier, Washington | 2023-02-17 00:28:25.460000-08:00 | -122.75 | 46.84 |
| 456 | 0.23 | Washington | 2023-05-23 08:49:02.450000-07:00 | -121.98 | 46.87 |
| 789 | 1.67 | Puget Sound region, Washington | 2023-08-26 04:04:11.200000-07:00 | -122.57 | 47.6 |
>= and isin, a Pixeltable
where clause can also contain more complex operations. For example,
the location column in our dataset is a string that contains a lot of
information, but in a relatively unstructured way. Suppose we wanted to
see all Earthquakes in the vicinity of Rainier, Washington; one way to
do this is with the contains() method:
| id | magnitude | location | timestamp | longitude | latitude |
|---|---|---|---|---|---|
| 40 | 1.22 | 11 km SSE of Rainier, Washington | 2023-01-19 21:52:29.910000-08:00 | -122.65 | 46.79 |
| 85 | 1.45 | 10 km SSE of Rainier, Washington | 2023-02-02 20:08:27.810000-08:00 | -122.65 | 46.79 |
| 123 | 1.23 | 7 km SW of Rainier, Washington | 2023-02-17 00:28:25.460000-08:00 | -122.75 | 46.84 |
| 467 | 1.09 | 10 km SSE of Rainier, Washington | 2023-05-26 19:39:44.120000-07:00 | -122.65 | 46.8 |
| 1399 | 1.08 | 5 km SW of Rainier, Washington | 2024-03-04 22:34:25.210000-08:00 | -122.74 | 46.85 |
| 1709 | 1.16 | 10 km S of Rainier, Washington | 2024-05-22 18:28:38.130000-07:00 | -122.68 | 46.79 |
| 1776 | 1.17 | 12 km S of Rainier, Washington | 2024-06-17 18:25:33.400000-07:00 | -122.66 | 46.77 |
max() and min():
| min | max |
|---|---|
| 0 | 1822 |
Extracting Data from Tables into Python/Pandas
Sometimes it’s handy to pull out data from a table into a Python object. We’ve actually already done this; the call tocollect() returns an
in-memory result set, which we can then dereference in various ways. For
example:
collect() without a preceding limit() returns the entire contents of
a query or table. Be careful! For very large tables, this could result
in out-of-memory errors. In this example, the 1823 rows in the table fit
comfortably into a dataframe.
Adding Columns
Like other database tables, Pixeltable tables aren’t fixed entities: they’re meant to evolve over time. Suppose we want to add a new column to hold user-specified comments about particular earthquake events. We can do this with theadd_column() method:
note is the column name, and pxt.String specifies the type of
the new column.
| table ‘fundamentals.earthquakes’ |
| Column Name | Type | Computed With |
|---|---|---|
| id | Required[Int] | |
| magnitude | Float | |
| location | String | |
| timestamp | Timestamp | |
| longitude | Float | |
| latitude | Float | |
| note | String | |
| contact_email | String |
Updating Rows in a Table
Table rows can be modified and deleted with the SQL-likeupdate() and
delete() commands.
| id | magnitude | note | contact_email |
|---|---|---|---|
| 120 | 1.17 | None | None |
| 121 | 1.87 | Still investigating. | [email protected] |
| 122 | 0.34 | None | None |
| 123 | 1.23 | Still investigating. | [email protected] |
| 124 | 0.13 | None | None |
update() can also accept an expression, rather than a constant value.
For example, suppose we wanted to shorten the location strings by
replacing every occurrence of Washington with WA. One way to do this
is with an update() clause, using a Pixeltable expression with the
replace() method.
| id | magnitude | location | timestamp | longitude | latitude | note | contact_email |
|---|---|---|---|---|---|---|---|
| 0 | 1.15 | 10 km NW of Belfair, WA | 2023-01-01 08:10:37.050000-08:00 | -122.93 | 47.51 | None | None |
| 1 | 0.29 | 23 km ENE of Ashford, WA | 2023-01-02 01:02:43.950000-08:00 | -121.76 | 46.85 | None | None |
| 2 | 0.2 | 23 km ENE of Ashford, WA | 2023-01-02 12:05:01.420000-08:00 | -121.75 | 46.86 | None | None |
| 3 | 0.52 | 15 km NNE of Ashford, WA | 2023-01-02 12:45:14.220000-08:00 | -121.95 | 46.89 | None | None |
| 4 | 1.56 | 0 km WSW of Esperance, WA | 2023-01-02 13:19:27.200000-08:00 | -122.36 | 47.79 | None | None |
update() clause takes a Python
dictionary, but its values can be either constants such as
'[email protected]', or more complex expressions such as
eq_t.location.replace('Washington', 'WA'). Also notice that if
update() appears without a where() clause, then every row in the
table will be updated, as in the preceding example.
Batch Updates
Thebatch_update() method provides an alternative way to update
multiple rows with different values. With a batch_update(), the
contents of each row are specified by individual dicts, rather than
according to a formula. Here’s a toy example that shows batch_update()
in action.
| id | magnitude | note | contact_email |
|---|---|---|---|
| 500 | 0.75 | This is an example note. | None |
| 501 | 0.23 | This is a different note. | None |
| 502 | 0.43 | A third note, unrelated to the others. | None |
| 503 | 0.31 | None | None |
| 504 | 0.35 | None | None |
Deleting Rows
To delete rows from a table, use thedelete() method.
where() clause when using delete()! If you
run delete() without a where() clause, the entire contents of the
table will be deleted.
Table Versioning
Every table in Pixeltable is versioned: some or all of its modification history is preserved. We’ve seen a reference to this already;pxt.ls()
will show the most recent version along with each table it lists.
| Name | Kind | Version | Base | |
|---|---|---|---|---|
| earthquakes | table | 8 | ||
| films | table | 2 |
| version | created_at | user | change_type | inserts | updates | deletes | errors | computed | schema_change | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 2025-10-10 18:18:17.357601+00:00 | None | data | 0 | 0 | 1236 | 0 | 0 | None |
| 1 | 7 | 2025-10-10 18:18:16.323425+00:00 | None | data | 0 | 0 | 587 | 0 | 0 | None |
| 2 | 6 | 2025-10-10 18:18:15.740511+00:00 | None | data | 0 | 3 | 0 | 0 | 3 | None |
| 3 | 5 | 2025-10-10 18:18:15.200479+00:00 | None | data | 0 | 1823 | 0 | 0 | 1823 | None |
| 4 | 4 | 2025-10-10 18:18:14.646669+00:00 | None | data | 0 | 2 | 0 | 0 | 4 | None |
| 5 | 3 | 2025-10-10 18:18:13.675782+00:00 | None | schema | 0 | 1823 | 0 | 0 | 1823 | Added: contact_email |
| 6 | 2 | 2025-10-10 18:18:13.336046+00:00 | None | schema | 0 | 1823 | 0 | 0 | 1823 | Added: note |
| 7 | 1 | 2025-10-10 18:18:08.559563+00:00 | None | data | 1823 | 0 | 0 | 0 | 1823 | None |
| 8 | 0 | 2025-10-10 18:18:08.511119+00:00 | None | schema | 0 | 0 | 0 | 0 | 0 | Initial Version |
revert() to undo the
most recent change to a table and roll back to the previous version.
Let’s try it out: we’ll use it to revert the successive delete() calls
that we just executed.
revert() cannot be undone!
Multimodal Data
In addition to the structured data we’ve been exploring so far in this tutorial, Pixeltable has native support for media types: images, video, audio, and unstructured documents such as pdfs. Media support is one of Pixeltable’s core capabilities, and we’ll have much more to say about it the upcoming Unstructured Data section of this tutorial. For now, we’ll just give one example to show how media data lives side-by-side with structured data in Pixeltable.| table ‘fundamentals.earthquakes’ |
| Column Name | Type | Computed With |
|---|---|---|
| id | Required[Int] | |
| magnitude | Float | |
| location | String | |
| timestamp | Timestamp | |
| longitude | Float | |
| latitude | Float | |
| note | String | |
| contact_email | String | |
| map_image | Image |
| id | magnitude | location | map_image |
|---|---|---|---|
| 1000 | -0.02 | 17 km SSE of Carbonado, WA | None |
| 1001 | 0.82 | 22 km ENE of Ashford, WA | None |
| 1002 | 4.3 | Port Townsend, WA | |
| 1003 | 1.04 | WA | None |
| 1004 | 0.79 | 24 km ENE of Ashford, WA | None |
Directory Hierarchies
So far we’ve only seen an example of a single directory with a table inside it, but one can also put directories inside other directories, in whatever fashion makes the most sense for a given application.| table 'fundamentals.subdir.subsubdir.my_table' |
| Column Name | Type | Computed With |
|---|---|---|
| my_col | String |
Deleting Columns, Tables, and Directories
drop_column(), drop_table(), and drop_dir() are used to delete
columns, tables, and directories, respectively.
| table ‘fundamentals.earthquakes’ |
| Column Name | Type | Computed With |
|---|---|---|
| id | Required[Int] | |
| magnitude | Float | |
| location | String | |
| timestamp | Timestamp | |
| longitude | Float | |
| latitude | Float | |
| note | String | |
| map_image | Image |