Audio and Video Transcript Indexing
Extract, Transcribe, Index: Streamline Video Analysis with Pixeltable's AI Toolkit
Transcribing and Indexing Audio and Video in Pixeltable
In this tutorial, we'll build an end-to-end workflow for creating and indexing audio transcriptions of video data. We'll demonstrate how Pixeltable can be used to:
- Extract audio data from video files;
- Transcribe the audio using OpenAI Whisper;
- Build a semantic index of the transcriptions, using the Huggingface sentence_transformers models;
- Search this index.
The tutorial assumes you're already somewhat familiar with Pixeltable. If this is your first time using Pixeltable, the Pixeltable Basics tutorial is a great place to start.
If you are running this tutorial in Colab:
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the Runtime -> Change runtime type
menu item at the top, then select the GPU
radio button and click on Save
.
Create a Table for Video Data
Let's first install the Python packages we'll need for the demo. We're going to use the popular Whisper library, running locally. Later in the demo, we'll see how to use the OpenAI API endpoints as an alternative.
%pip install -q pixeltable openai openai-whisper sentence-transformers spacy
Now we create a Pixeltable table to hold our videos.
import numpy as np
import pixeltable as pxt
pxt.drop_dir('transcription_demo', force=True) # Ensure a clean slate for the demo
pxt.create_dir('transcription_demo')
# Create a table to store our videos and workflow
video_table = pxt.create_table(
'transcription_demo.video_table',
{'video': pxt.Video}
)
video_table
Connected to Pixeltable database at:
postgresql://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `transcription_demo`.
Created table `video_table`.
Column Name | Type | Computed With |
---|---|---|
video | video |
Next let's insert some video files into the table. In this demo, we'll be using one-minute excerpts from a Lex Fridman podcast. We'll begin by inserting two of them into our new table. In this demo, our videos are given as https
links, but Pixeltable also accepts local files and S3 URLs as input.
videos = [
'https://github.com/pixeltable/pixeltable/raw/release/docs/source/data/audio-transcription-demo/'
f'Lex-Fridman-Podcast-430-Excerpt-{n}.mp4'
for n in range(3)
]
video_table.insert({'video': video} for video in videos[:2])
video_table.show()
Inserting rows into `video_table`: 2 rows [00:00, 1928.42 rows/s]
Inserted 2 rows with 0 errors.
Now we'll add another column to hold extracted audio from our videos. The new column is an example of a computed column: it's updated automatically based on the contents of another column (or columns). In this case, the value of the audio
column is defined to be the audio track extracted from whatever's in the video
column.
from pixeltable.functions.video import extract_audio
video_table['audio'] = extract_audio(video_table.video, format='mp3')
video_table.show()
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 2.16 cells/s]
Added 2 column values with 0 errors.
If we look at the structure of the video table, we see that the new column is a computed column.
video_table
Column Name | Type | Computed With |
---|---|---|
video | video | |
audio | audio | extract_audio(video, format='mp3') |
We can also add another computed column to extract metadata from the audio streams.
from pixeltable.functions.audio import get_metadata
video_table['metadata'] = get_metadata(video_table.audio)
video_table.show()
Computing cells: 100%|βββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 199.10 cells/s]
Added 2 column values with 0 errors.
Create Transcriptions
Now we'll add a step to create transcriptions of our videos. As mentioned above, we're going to use the Whisper library for this, running locally. Pixeltable has a built-in function, whisper.transcribe
, that serves as an adapter for the Whisper library's transcription capability. All we have to do is add a computed column that calls this function:
from pixeltable.functions import whisper
video_table['transcription'] = whisper.transcribe(
audio=video_table.audio, model='base.en'
)
video_table.select(
video_table.video,
video_table.transcription.text
).show()
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:05<00:00, 2.60s/ cells]
Added 2 column values with 0 errors.
In order to index the transcriptions, we'll first need to split them into sentences. We can do this using Pixeltable's built-in StringSplitter
iterator.
from pixeltable.iterators.string import StringSplitter
sentences_view = pxt.create_view(
'transcription_demo.sentences_view',
video_table,
iterator=StringSplitter.create(
text=video_table.transcription.text,
separators='sentence'
)
)
Inserting rows into `sentences_view`: 25 rows [00:00, 8918.74 rows/s]
Created view `sentences_view` with 25 rows, 0 exceptions.
The StringSplitter
creates a new view, with the audio transcriptions broken into individual, one-sentence chunks.
sentences_view.select(
sentences_view.pos,
sentences_view.text
).show(8)
pos | text |
---|---|
0 | of experiencing self versus remembering self. |
1 | I was hoping you can give a simple answer of how we should live life. |
2 | Based on the fact that our memories could be a source of happiness or could be the primary source of happiness, that an event when experienced bears its fruits the most when it's remembered over and over and over and over. |
3 | And maybe there is some wisdom in the fact that we can control to some degree how we remember how we evolve our memory of it, such that it can maximize the long-term happiness of that repeated experience. |
4 | Oh, well, first I'll say I wish I could take you on the road with me. |
5 | That was such a great description. |
6 | Can I be your opening answer? |
7 | Oh my God, no, I'm gonna open for you, dude. |
Add an Embedding Index
Next, let's use the Huggingface sentence_transformers
library to create an embedding index of our sentences, attaching it to the text
column of our sentences_view
.
from pixeltable.functions.huggingface import sentence_transformer
@pxt.expr_udf
def e5_embed(text: str) -> np.ndarray:
return sentence_transformer(text, model_id='intfloat/e5-large-v2')
sentences_view.add_embedding_index('text', string_embed=e5_embed)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 25/25 [00:01<00:00, 18.67 cells/s]
We can do a simple lookup to test our new index. The following snippet returns the results of a nearest-neighbor search on the input "What is happiness?"
sim = sentences_view.text.similarity('What is happiness?')
(
sentences_view
.order_by(sim, asc=False)
.limit(10)
.select(sentences_view.text,similarity=sim)
.collect()
)
text | similarity |
---|---|
Based on the fact that our memories could be a source of happiness or could be the primary source of happiness, that an event when experienced bears its fruits the most when it's remembered over and over and over and over. | 0.805 |
I was hoping you can give a simple answer of how we should live life. | 0.792 |
Why would we have this period of time that's so short when we're perfect, right? | 0.789 |
I want to really be | 0.788 |
Can I be your opening answer? | 0.785 |
of experiencing self versus remembering self. | 0.785 |
I need a prefrontal cortex | 0.785 |
And maybe there is some wisdom in the fact that we can control to some degree how we remember how we evolve our memory of it, such that it can maximize the long-term happiness of that repeated experience. | 0.785 |
What's the best way to do that? | 0.783 |
And it's like, I realize I have to redefine what optimal is because for most of the human condition, I think we had a series of stages of life where you have basically adults saying, okay, young adults saying, I've got a child and, you know, I'm part of this village and I have to hunt and forage and get things done. | 0.776 |
Incremental Updates
Incremental updates are a key feature of Pixeltable. Whenever a new video is added to the original table, all of its downstream computed columns are updated automatically. Let's demonstrate this by adding a third video to the table and seeing how the updates propagate through to the index.
video_table.insert(video=videos[2])
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:02<00:00, 1.31 cells/s]
Inserting rows into `video_table`: 1 rows [00:00, 394.94 rows/s]
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:02<00:00, 1.31 cells/s]
Inserting rows into `sentences_view`: 8 rows [00:00, 1084.89 rows/s]
Inserted 9 rows with 0 errors.
UpdateStatus(num_rows=9, num_computed_values=3, num_excs=0, updated_cols=[], cols_with_excs=[])
video_table.select(
video_table.video,
video_table.metadata,
video_table.transcription.text
).show()
sim = sentences_view.text.similarity('What is happiness?')
(
sentences_view
.order_by(sim, asc=False)
.limit(20)
.select(sentences_view.text, similarity=sim)
.collect()
)
text | similarity |
---|---|
Based on the fact that our memories could be a source of happiness or could be the primary source of happiness, that an event when experienced bears its fruits the most when it's remembered over and over and over and over. | 0.805 |
These are chemicals that are released during moments that tend to be biologically significant, prize, fear, stress, etc. | 0.798 |
I was hoping you can give a simple answer of how we should live life. | 0.792 |
Why would we have this period of time that's so short when we're perfect, right? | 0.789 |
I want to really be | 0.788 |
Can I be your opening answer? | 0.785 |
of experiencing self versus remembering self. | 0.785 |
I need a prefrontal cortex | 0.785 |
And maybe there is some wisdom in the fact that we can control to some degree how we remember how we evolve our memory of it, such that it can maximize the long-term happiness of that repeated experience. | 0.785 |
Essentially some mechanisms for which the brain can say prioritize the information that you carry with you into the future. | 0.783 |
Attention is a big factor as well, our ability to focus our attention on what's important. | 0.783 |
What's the best way to do that? | 0.783 |
And it's like, I realize I have to redefine what optimal is because for most of the human condition, I think we had a series of stages of life where you have basically adults saying, okay, young adults saying, I've got a child and, you know, I'm part of this village and I have to hunt and forage and get things done. | 0.776 |
about reusing information and making the most of what we already have. | 0.774 |
so I can stay focused on the big picture and the long haul goals. | 0.772 |
I don't want to be constrained by goals as much. | 0.767 |
Or optimal. | 0.767 |
That was such a great description. | 0.766 |
And so that's why basically again, what you see biologically is neuromodulators, for instance, these chemicals in the brain like norepinephrine, dopamine, serotonin. | 0.759 |
So one of my colleagues, Amishi Jia, she wrote a book called Peak Mind and talks about mindfulness as a method for improving attention and focus. | 0.756 |
We can see the new results showing up in sentences_view
.
Using the OpenAI API
This concludes our tutorial using the locally installed Whisper library. Sometimes, it may be preferable to use the OpenAI API rather than a locally installed library. In this section we'll show how this can be done in Pixeltable, simply by using a different function to construct our computed columns.
Since this section relies on calling out to the OpenAI API, you'll need to have an API key, which you can enter below.
import os
import getpass
if 'OPENAI_API_KEY' not in os.environ:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
OpenAI API Key: Β·Β·Β·Β·Β·Β·Β·Β·
from pixeltable.functions import openai
video_table['transcription_from_api'] = openai.transcriptions(
video_table.audio, model='whisper-1'
)
Computing cells: 100%|ββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:09<00:00, 3.22s/ cells]
Added 3 column values with 0 errors.
Now let's compare the results from the local model and the API side-by-side.
video_table.select(
video_table.video,
video_table.transcription.text,
video_table.transcription_from_api.text
).show()
They look pretty similar, which isn't surprising, since the OpenAI transcriptions endpoint runs on Whisper.
One difference is that the local library spits out a lot more information about the internal behavior of the model. Note that we've been selecting video_table.transcription.text
in the preceding queries, which pulls out just the text
field of the transcription results. The actual results are a sizable JSON structure that includes a lot of metadata. To see the full output, we can select video_table.transcription
instead, to get the full JSON struct. Here's what it looks like (we'll select just one row, since it's a lot of output):
video_table.select(
video_table.transcription,
video_table.transcription_from_api
).show(1)
transcription | transcription_from_api |
---|---|
{"text": " of experiencing self versus remembering self. I was hoping you can give a simple answer of how we should live life. Based on the fact that our me ...... ion. Can I be your opening answer? Oh my God, no, I'm gonna open for you, dude. Otherwise it's like, you know, everybody leaves after you're done.", "language": "en", "segments": [{"id": 0, "end": 5., "seek": 0, "text": " of experiencing self versus remembering self.", "start": 0., "tokens": [50363, 286, 13456, 2116, 9051, 24865, 2116, 13, 50613], "avg_logprob": -0.282, "temperature": 0., "no_speech_prob": 0.213, "compression_ratio": 1.632}, {"id": 1, "end": 8.68, "seek": 0, "text": " I was hoping you can give a simple answer", "start": 6., "tokens": [50663, 314, 373, 7725, 345, 460, 1577, 257, 2829, 3280, 50797], "avg_logprob": -0.282, "temperature": 0., "no_speech_prob": 0.213, "compression_ratio": 1.632}, {"id": 2, "end": 10.2, "seek": 0, "text": " of how we should live life.", "start": 8.68, "tokens": [50797, 286, 703, 356, 815, 2107, 1204, 13, 50873], "avg_logprob": -0.282, "temperature": 0., "no_speech_prob": 0.213, "compression_ratio": 1.632}, {"id": 3, "end": 16.04, "seek": 0, "text": " Based on the fact that our memories", "start": 12.24, "tokens": [50975, 13403, 319, 262, 1109, 326, 674, 9846, 51165], "avg_logprob": -0.282, "temperature": 0., "no_speech_prob": 0.213, "compression_ratio": 1.632}, {"id": 4, "end": 17.84, "seek": 0, "text": " could be a source of happiness", "start": 16.04, "tokens": [51165, 714, 307, 257, 2723, 286, 12157, 51255], "avg_logprob": -0.282, "temperature": 0., "no_speech_prob": 0.213, "compression_ratio": 1.632}, {"id": 5, "end": 20.52, "seek": 0, "text": " or could be the primary source of happiness,", "start": 17.84, "tokens": [51255, 393, 714, 307, 262, 4165, 2723, 286, 12157, 11, 51389], "avg_logprob": -0.282, "temperature": 0., "no_speech_prob": 0.213, "compression_ratio": 1.632}, ..., {"id": 14, "end": 48.52, "seek": 2552, "text": " on the road with me.", "start": 47.52, "tokens": [51463, 319, 262, 2975, 351, 502, 13, 51513], "avg_logprob": -0.285, "temperature": 0., "no_speech_prob": 0.001, "compression_ratio": 1.611}, {"id": 15, "end": 50.52, "seek": 2552, "text": " That was such a great description.", "start": 48.52, "tokens": [51513, 1320, 373, 884, 257, 1049, 6764, 13, 51613], "avg_logprob": -0.285, "temperature": 0., "no_speech_prob": 0.001, "compression_ratio": 1.611}, {"id": 16, "end": 52.88, "seek": 2552, "text": " Can I be your opening answer?", "start": 51.52, "tokens": [51663, 1680, 314, 307, 534, 4756, 3280, 30, 51731], "avg_logprob": -0.285, "temperature": 0., "no_speech_prob": 0.001, "compression_ratio": 1.611}, {"id": 17, "end": 56.08, "seek": 5288, "text": " Oh my God, no, I'm gonna open for you, dude.", "start": 52.88, "tokens": [50363, 3966, 616, 1793, 11, 645, ..., 329, 345, 11, 18396, 13, 50523], "avg_logprob": -0.337, "temperature": 0., "no_speech_prob": 0.012, "compression_ratio": 1.121}, {"id": 18, "end": 57.28, "seek": 5288, "text": " Otherwise it's like, you know,", "start": 56.08, "tokens": [50523, 15323, 340, 338, 588, 11, 345, 760, 11, 50583], "avg_logprob": -0.337, "temperature": 0., "no_speech_prob": 0.012, "compression_ratio": 1.121}, {"id": 19, "end": 58.88, "seek": 5288, "text": " everybody leaves after you're done.", "start": 57.28, "tokens": [50583, 7288, 5667, 706, 345, 821, 1760, 13, 50663], "avg_logprob": -0.337, "temperature": 0., "no_speech_prob": 0.012, "compression_ratio": 1.121}]} | {"text": "of experiencing self versus remembering self, I was hoping you can give a simple answer of how we should live life. Based on the fact that our mem ...... ning excerpt? Oh my God, no, I'm gonna open for you, dude. Otherwise, it's like, you know, everybody leaves after you're done. Ha, ha, ha, ha, ha."} |
Updated about 1 month ago