Documentation Index
Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
Use this file to discover all available pages before exploring further.
Set Up the Table Structure
We start by installing the necessary dependencies, creating a Pixeltable directoryrag_ops_demo (if it doesn’t already exist), and setting up
the table structure for our new workflow.
Creating Tables and Views
Now we’ll create the tables that represent our workflow, starting with a table to hold references to source documents. The table contains a single columnsource_doc whose elements have type pxt.Document,
representing a general document instance. In this tutorial, we’ll be
working with PDF documents, but Pixeltable supports a range of other
document types, such as Markdown and HTML.
Created table ‘docs’.
If we take a peek at the docs table, we see its very simple structure.
document_splitter class.
Note that the docs table is currently empty, so creating this view
doesn’t actually do anything yet: it simply defines an operation that
we want Pixeltable to execute when it sees new data.
sentences view.
sentences inherits the source_doc column from docs,
together with some new fields:
pos: The position in the source document where the sentence appears.text: The text of the sentence.title,heading, andsourceline: The metadata we requested when we set up the view.
Data Ingestion
Ok, now it’s time to insert some data into our workflow. A document in Pixeltable is just a URL; the following command inserts a single row into thedocs table with the source_doc field set to the specified
URL:
Inserting rows into `docs`: 1 rows [00:00, 292.76 rows/s]
Inserting rows into `sentences`: 217 rows [00:00, 42910.00 rows/s]
Inserted 218 rows with 0 errors.
218 rows inserted, 2 values computed.
We can see that two things happened. First, a single row was inserted
into docs, containing the URL representing our source PDF. Then, the
view sentences was incrementally updated by applying the
document_splitter according to the definition of the view. This
illustrates an important principle in Pixeltable: by default, anytime
Pixeltable sees new data, the update is incrementally propagated to any
downstream views or computed columns.
We can see the effect of the insertion with the select command.
There’s a single row in docs:
sentences. The content of the PDF is
broken into individual sentences, as expected.
Experimenting with Chunking
Of course, chunking into sentences isn’t the only way to split a document. Perhaps we want to experiment with different chunking methodologies, in order to see which one performs best in a particular application. Pixeltable makes it easy to do this, by creating several views of the same source table. Here are a few examples. Notice that as each new view is created, it is initially populated from the data already indocs.
Inserting rows into `chunks`: 217 rows [00:00, 47827.85 rows/s]
Inserting rows into `short_chunks`: 219 rows [00:00, 49104.70 rows/s]
Inserting rows into `short_char_chunks`: 459 rows [00:00, 63241.10 rows/s]
Inserting rows into `docs`: 3 rows [00:00, 1969.77 rows/s]
Inserting rows into `chunks`: 742 rows [00:00, 61926.41 rows/s]
Inserting rows into `short_chunks`: 747 rows [00:00, 67743.68 rows/s]
Inserting rows into `sentences`: 742 rows [00:00, 67949.90 rows/s]
Inserting rows into `short_char_chunks`: 1165 rows [00:00, 3603.41 rows/s]
Inserted 3399 rows with 0 errors.
3399 rows inserted, 6 values computed.
Further Experiments
This is a good time to mention another important guiding principle of Pixeltable. The preceding examples all used the built-indocument_splitter class with various configurations. That’s probably
fine as a first cut or to prototype an application quickly, and it might
be sufficient for some applications. But other applications might want
to do more sophisticated kinds of chunking, implementing their own
specialized logic or leveraging third-party tools. Pixeltable imposes no
constraints on the AI or RAG operations a workflow uses: the iterator
interface is highly general, and it’s easy to implement new operations
or adapt existing code or third-party tools into the Pixeltable
workflow.
Computing Embeddings
Next, let’s look at how embedding indices can be added seamlessly to existing Pixeltable workflows. To compute our embeddings, we’ll use the Huggingfacesentence_transformer package, running it over the chunks
view that broke our documents up into sentence-based chunks. Pixeltable
has a built-in sentence_transformer adapter, and all we have to do is
add a new column that leverages it. Pixeltable takes care of the rest,
applying the new column to all existing data in the view.
Added 959 column values with 0 errors.
959 rows updated, 959 values computed.
The new column is a computed column: it is defined as a function on
top of existing data and updated incrementally as new data are added to
the workflow. Let’s have a look at how the new column affected the
chunks view.
Added 959 column values with 0 errors.
959 rows updated, 959 values computed.