Documentation Index
Fetch the complete documentation index at: https://docs.pixeltable.com/llms.txt
Use this file to discover all available pages before exploring further.
- Extract audio data from video files;
- Transcribe the audio using OpenAI Whisper;
- Build a semantic index of the transcriptions, using the Huggingface sentence_transformers models;
- Search this index.
Create a Table for Video Data
Let’s first install the Python packages we’ll need for the demo. We’re going to use the popular Whisper library, running locally. Later in the demo, we’ll see how to use the OpenAI API endpoints as an alternative.Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory ‘transcription_demo’.
Created table ‘video_table’.
Next let’s insert some video files into the table. In this demo, we’ll
be using one-minute excerpts from a Lex Fridman podcast. We’ll begin by
inserting two of them into our new table. In this demo, our videos are
given as https links, but Pixeltable also accepts local files and S3
URLs as input.
Inserted 2 rows with 0 errors in 2.04 s (0.98 rows/s)
Now we’ll add another column to hold extracted audio from our videos.
The new column is an example of a computed column: it’s updated
automatically based on the contents of another column (or columns). In
this case, the value of the audio column is defined to be the audio
track extracted from whatever’s in the video column.
Added 2 column values with 0 errors in 0.91 s (2.19 rows/s)
If we look at the structure of the video table, we see that the new
column is a computed column.
Added 2 column values with 0 errors in 0.02 s (95.47 rows/s)
Create Transcriptions
Now we’ll add a step to create transcriptions of our videos. As mentioned above, we’re going to use the Whisper library for this, running locally. Pixeltable has a built-in function,whisper.transcribe, that serves as an adapter for the Whisper
library’s transcription capability. All we have to do is add a computed
column that calls this function:
Added 2 column values with 0 errors in 4.63 s (0.43 rows/s)
In order to index the transcriptions, we’ll first need to split them
into sentences. We can do this using Pixeltable’s built-in
string_splitter iterator.
string_splitter creates a new view, with the audio transcriptions
broken into individual, one-sentence chunks.
Add an Embedding Index
Next, let’s use the Huggingfacesentence_transformers library to
create an embedding index of our sentences, attaching it to the text
column of our sentences_view.
modules.json: 0%| | 0.00/387 [00:00<?, ?B/s]
README.md: 0.00B [00:00, ?B/s]
sentence_bert_config.json: 0%| | 0.00/57.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/616 [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/1.34G [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/314 [00:00<?, ?B/s]
vocab.txt: 0.00B [00:00, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]
config.json: 0%| | 0.00/201 [00:00<?, ?B/s]
We can do a simple lookup to test our new index. The following snippet
returns the results of a nearest-neighbor search on the input “What is
happiness?”
Incremental Updates
Incremental updates are a key feature of Pixeltable. Whenever a new video is added to the original table, all of its downstream computed columns are updated automatically. Let’s demonstrate this by adding a third video to the table and seeing how the updates propagate through to the index.Inserted 10 rows with 0 errors in 4.20 s (2.38 rows/s)
10 rows inserted.
sentences_view.
Using the OpenAI API
This concludes our tutorial using the locally installed Whisper library. Sometimes, it may be preferable to use the OpenAI API rather than a locally installed library. In this section we’ll show how this can be done in Pixeltable, simply by using a different function to construct our computed columns. Since this section relies on calling out to the OpenAI API, you’ll need to have an API key, which you can enter below.Added 3 column values with 0 errors in 6.49 s (0.46 rows/s)
3 rows updated.
Now let’s compare the results from the local model and the API
side-by-side.
video_table.transcription.text in the preceding
queries, which pulls out just the text field of the transcription
results. The actual results are a sizable JSON structure that includes a
lot of metadata. To see the full output, we can select
video_table.transcription instead, to get the full JSON struct. Here’s
what it looks like (we’ll select just one row, since it’s a lot of
output):