Document Indexing and RAG
A hands-on guide to building a question-answering system (chatbot) on your documents.
Document Indexing and RAG¶
In this tutorial, we'll demonstrate how RAG operations can be implemented in Pixeltable. In particular, we'll develop a RAG application that summarizes a collection of PDF documents and uses ChatGPT to answer questions about them.
In a traditional RAG workflow, such operations might be implemented as a Python script that runs on a periodic schedule or in response to certain events. In Pixeltable, they are implemented as persistent tables that are updated automatically and incrementally as new data becomes available.
If you are running this tutorial in Colab:
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the Runtime -> Change runtime type
menu item at the top, then select the GPU
radio button and click on Save
.
We first set up our OpenAI API key:
import os
import getpass
if 'OPENAI_API_KEY' not in os.environ:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
We then install the packages we need for this tutorial and then set up our environment.
%pip install -q pixeltable sentence-transformers tiktoken openai openpyxl
import numpy as np
import pixeltable as pxt
pxt.drop_dir('rag_demo', force=True) # Ensure a clean slate for the demo
pxt.create_dir('rag_demo')
Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata Created directory `rag_demo`.
<pixeltable.catalog.dir.Dir at 0x32381b520>
Next we'll create a table containing the sample questions we want to answer. The questions are stored in an Excel spreadsheet, along with a set of "ground truth" answers to help evaluate our model pipeline. We can use Pixeltable's handy import_excel()
utility to load them. Note that we can pass the URL of the spreadsheet directly to the import utility.
base = 'https://github.com/pixeltable/pixeltable/raw/release/docs/source/data/rag-demo/'
qa_url = base + 'Q-A-Rag.xlsx'
queries_t = pxt.io.import_excel('rag_demo.queries', qa_url)
Created table `queries`. Inserting rows into `queries`: 8 rows [00:00, 4485.29 rows/s] Inserted 8 rows with 0 errors.
queries_t.head()
S__No_ | Question | correct_answer |
---|---|---|
1 | What is roughly the current mortage rate? | 0.07 |
2 | What is the current dividend yield for Alphabet Inc. (\$GOOGL)? | 0.0046 |
3 | What is the market capitalization of Alphabet? | \$2182.8 Billion |
4 | What are the latest financial metrics for Accenture PLC? | missed consensus forecasts and strong total bookings rising by 22% annually |
5 | What is the overall latest rating for Amazon.com from analysts? | SELL |
6 | What is the operating cash flow of Amazon in Q1 2024? | 18,989 Million |
7 | What is the expected EPS for Nvidia in Q1 2026? | 0.73 EPS |
8 | What are the main reasons to buy Nvidia? | Datacenter, GPUs Demands, Self-driving, and cash-flow |
Outline¶
There are two major parts to our RAG application:
- Document Indexing: Load the documents, split them into chunks, and index them using a vector embedding.
- Querying: For each question on our list, do a top-k lookup for the most relevant chunks, use them to construct a ChatGPT prompt, and send the enriched prompt to an LLM.
We'll implement both parts in Pixeltable.
1. Document Indexing¶
All data in Pixeltable, including documents, resides in tables.
Tables are persistent containers that can serve as the store of record for your data. Since we are starting from scratch, we will start with an empty table rag_demo.documents
with a single column, document
.