This documentation page is also available as an interactive notebook. You can launch the notebook in
Kaggle or Colab, or download it for use with an IDE or local Jupyter installation, by clicking one of the
above links.
In this tutorial, we’ll explore Pixeltable’s flexible handling of RAG
operations on unstructured text. In a traditional AI workflow, such
operations might be implemented as a Python script that runs on a
periodic schedule or in response to certain events. In Pixeltable, as
with everything else, they are implemented as persistent table
operations that update incrementally as new data becomes available. In
our tutorial workflow, we’ll chunk Wikipedia articles in various ways
with a document splitter, then apply several kinds of embeddings to the
chunks.
Set Up the Table Structure
We start by installing the necessary dependencies, creating a Pixeltable
directory rag_ops_demo (if it doesn’t already exist), and setting up
the table structure for our new workflow.
%pip install -qU pixeltable sentence-transformers spacy tiktoken
import pixeltable as pxt
# Ensure a clean slate for the demo
pxt.drop_dir('rag_ops_demo', force=True)
# Create the Pixeltable workspace
pxt.create_dir('rag_ops_demo')
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `rag_ops_demo`.
<pixeltable.catalog.dir.Dir at 0x341466e50>
Creating Tables and Views
Now we’ll create the tables that represent our workflow, starting with a
table to hold references to source documents. The table contains a
single column source_doc whose elements have type pxt.Document,
representing a general document instance. In this tutorial, we’ll be
working with HTML documents, but Pixeltable supports a range of other
document types, such as Markdown and PDF.
docs = pxt.create_table(
'rag_ops_demo.docs',
{'source_doc': pxt.Document}
)
Created table `docs`.
If we take a peek at the docs table, we see its very simple structure.
Next we create a view to represent chunks of our HTML documents. A
Pixeltable view is a virtual table, which is dynamically derived from a
source table by applying a transformation and/or selecting a subset of
data. In this case, our view represents a one-to-many transformation
from source documents into individual sentences. This is achieved using
Pixeltable’s built-in DocumentSplitter class.
Note that the docs table is currently empty, so creating this view
doesn’t actually do anything yet: it simply defines an operation that
we want Pixeltable to execute when it sees new data.
from pixeltable.iterators.document import DocumentSplitter
sentences = pxt.create_view(
'rag_ops_demo.sentences', # Name of the view
docs, # Table from which the view is derived
iterator=DocumentSplitter.create(
document=docs.source_doc,
separators='sentence', # Chunk docs into sentences
metadata='title,heading,sourceline'
)
)
Created view `sentences` with 0 rows, 0 exceptions.
Let’s take a peek at the new sentences view.
We see that sentences inherits the source_doc column from docs,
together with some new fields: - pos: The position in the source
document where the sentence appears. - text: The text of the
sentence. - title, heading, and sourceline: The metadata we
requested when we set up the view.
Data Ingestion
Ok, now it’s time to insert some data into our workflow. A document in
Pixeltable is just a URL; the following command inserts a single row
into the docs table with the source_doc field set to the specified
URL:
docs.insert([{'source_doc': 'https://en.wikipedia.org/wiki/Marc_Chagall'}])
Computing cells: 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 14.50 cells/s]
Inserting rows into `docs`: 1 rows [00:00, 739.08 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 14.20 cells/s]
Inserting rows into `sentences`: 1460 rows [00:00, 3529.45 rows/s]
Inserted 1461 rows with 0 errors.
UpdateStatus(num_rows=1461, num_computed_values=2, num_excs=0, updated_cols=[], cols_with_excs=[])
We can see that two things happened. First, a single row was inserted
into docs, containing the URL representing our source document. Then,
the view sentences was incrementally updated by applying the
DocumentSplitter according to the definition of the view. This
illustrates an important principle in Pixeltable: by default, anytime
Pixeltable sees new data, the update is incrementally propagated to any
downstream views or computed columns.
We can see the effect of the insertion with the select command.
There’s a single row in docs:
docs.select(docs.source_doc.fileurl).show()
And here are the first 20 rows in sentences. The content of the
article is broken into individual sentences, as expected.
sentences.select(sentences.text, sentences.heading).show(20)
Experimenting with Chunking
Of course, chunking into sentences isn’t the only way to split a
document. Perhaps we want to experiment with different chunking
methodologies, in order to see which one performs best in a particular
application. Pixeltable makes it easy to do this, by creating several
views of the same source table. Here are a few examples. Notice that as
each new view is created, it is initially populated from the data
already in docs.
chunks = pxt.create_view(
'rag_ops_demo.chunks', docs,
iterator=DocumentSplitter.create(
document=docs.source_doc,
separators='paragraph,token_limit',
limit=2048,
overlap=0,
metadata='title,heading,sourceline'
)
)
Inserting rows into `chunks`: 205 rows [00:00, 16715.90 rows/s]
Created view `chunks` with 205 rows, 0 exceptions.
short_chunks = pxt.create_view(
'rag_ops_demo.short_chunks', docs,
iterator=DocumentSplitter.create(
document=docs.source_doc,
separators='paragraph,token_limit',
limit=72,
overlap=0,
metadata='title,heading,sourceline'
)
)
Inserting rows into `short_chunks`: 531 rows [00:00, 23858.59 rows/s]
Created view `short_chunks` with 531 rows, 0 exceptions.
short_char_chunks = pxt.create_view(
'rag_ops_demo.short_char_chunks', docs,
iterator=DocumentSplitter.create(
document=docs.source_doc,
separators='paragraph,char_limit',
limit=72,
overlap=0,
metadata='title,heading,sourceline'
)
)
Inserting rows into `short_char_chunks`: 1764 rows [00:00, 17427.53 rows/s]
Created view `short_char_chunks` with 1764 rows, 0 exceptions.
chunks.select(chunks.text, chunks.heading).show(20)
short_chunks.select(short_chunks.text, short_chunks.heading).show(20)
short_char_chunks.select(short_char_chunks.text, short_char_chunks.heading).show(20)
Now let’s add a few more documents to our workflow. Notice how all of
the downstream views are updated incrementally, processing just the new
documents as they are inserted.
urls = [
'https://en.wikipedia.org/wiki/Pierre-Auguste_Renoir',
'https://en.wikipedia.org/wiki/Henri_Matisse',
'https://en.wikipedia.org/wiki/Marcel_Duchamp'
]
docs.insert({'source_doc': url} for url in urls)
Computing cells: 100%|████████████████████████████████████████████| 6/6 [00:00<00:00, 53.63 cells/s]
Inserting rows into `docs`: 3 rows [00:00, 2365.21 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 6/6 [00:00<00:00, 52.39 cells/s]
Inserting rows into `sentences`: 2106 rows [00:02, 783.63 rows/s]
Inserting rows into `chunks`: 276 rows [00:00, 15888.61 rows/s]
Inserting rows into `short_chunks`: 812 rows [00:00, 22184.42 rows/s]
Inserting rows into `short_char_chunks`: 2638 rows [00:00, 13227.11 rows/s]
Inserted 5835 rows with 0 errors.
UpdateStatus(num_rows=5835, num_computed_values=6, num_excs=0, updated_cols=[], cols_with_excs=[])Inserted 5831 rows with 0 errors.
UpdateStatus(num_rows=5831, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])
Further Experiments
This is a good time to mention another important guiding principle of
Pixeltable. The preceding examples all used the built-in
DocumentSplitter class with various configurations. That’s probably
fine as a first cut or to prototype an application quickly, and it might
be sufficient for some applications. But other applications might want
to do more sophisticated kinds of chunking, implementing their own
specialized logic or leveraging third-party tools. Pixeltable imposes no
constraints on the AI or RAG operations a workflow uses: the iterator
interface is highly general, and it’s easy to implement new operations
or adapt existing code or third-party tools into the Pixeltable
workflow.
Computing Embeddings
Next, let’s look at how embedding indices can be added seamlessly to
existing Pixeltable workflows. To compute our embeddings, we’ll use the
Huggingface sentence_transformer package, running it over the chunks
view that broke our documents up into larger paragraphs. Pixeltable has
a built-in sentence_transformer adapter, and all we have to do is add
a new column that leverages it. Pixeltable takes care of the rest,
applying the new column to all existing data in the view.
from pixeltable.functions.huggingface import sentence_transformer
chunks.add_computed_column(minilm_embed=sentence_transformer(
chunks.text,
model_id='paraphrase-MiniLM-L6-v2'
))
Computing cells: 100%|███████████████████████████████████████| 481/481 [00:02<00:00, 222.59 cells/s]
Added 481 column values with 0 errors.
UpdateStatus(num_rows=481, num_computed_values=481, num_excs=0, updated_cols=[], cols_with_excs=[])
The new column is a computed column: it is defined as a function on
top of existing data and updated incrementally as new data are added to
the workflow. Let’s have a look at how the new column affected the
chunks view.
chunks.select(chunks.text, chunks.heading, chunks.minilm_embed).head()
Similarly, we might want to add a CLIP embedding to our workflow; once
again, it’s just another computed column:
from pixeltable.functions.huggingface import clip
chunks.add_computed_column(clip_embed=clip(
chunks.text, model_id='openai/clip-vit-base-patch32'
))
Computing cells: 100%|████████████████████████████████████████| 481/481 [00:05<00:00, 93.49 cells/s]
Added 481 column values with 0 errors.
UpdateStatus(num_rows=481, num_computed_values=481, num_excs=0, updated_cols=[], cols_with_excs=[])
chunks.select(chunks.text, chunks.heading, chunks.clip_embed).head()