Video Summaries & Timestamps Guide

Learn how to split videos into parts, add timestamps, and create short summaries for each section

This step-by-step guide shows you how to break down videos into segments, add timestamps, and generate concise summaries for each section. It simplifies learning, organization, and quick access to key video content.

Tools & Tech Stack

Python 3.10
Libraries pandas, langchain, langchain-openai, openai
Model Open-AI - Whisper and GPT-4
Tool Visual Studio

Folder Structure

vs_agent/
- input/
  - video_file.mp4
- src/
  - utils.py
- engine_openai.py
- filtered_transcription.json
- chaptered_transcription.json
- requirements.txt

1. Install Required Libraries

This command will install all required dependencies specified in requirements.txt file

pip install -r requirements.txt

Code Explanation:

pip: pip is Python's package installer (Package Installer for Python)
install: install is the pip command to install packages
-r flag: -r tells pip to install from a requirements file rather than a single package
requirements.txt: requirements.txt is a text file containing the list of packages to install with their versions

The requirements.txt file contains the following libraries and dependencies for this project. You can also install them individually

pip install opencv-python==4.10.0.84

Description: OpenCV-Python (v4.10.0.84) is a computer vision library for video processing, image analysis, and object detection. It enables tasks like frame extraction, facial recognition, and real-time video analysis.

pip install pillow==11.1.0

Description: Pillow 11.1.0 Python Imaging Library (Fork) for image processing (open, manipulate, save various formats).

pip install transformers==4.48.0

Description: transformers 4.48.0 Hugging Face's NLP library providing state-of-the-art models (BERT, GPT, Whisper) for text, audio, and vision tasks.

pip install whisper-timestamped==1.15.8

Description: whisper-timestamped 1.15.8 Python library for speech recognition with word-level timestamps using OpenAI's Whisper model. Enables precise audio transcription and alignment.

pip install openai==1.59.7

Description: OpenAI 1.59.7 Python client library for accessing OpenAI's AI models (GPT, DALL·E, Whisper). Enables text generation, embeddings, and AI-powered content analysis.

pip install ffmpeg==1.4

Description: FFmpeg 1.4 Multimedia framework for video/audio processing, conversion, and streaming. Supports all major formats.

Step 2: Generate transcription if not available.

from src.utils import VideoFileClip
import os

def main(input_video):
    # Main function to process video and transcription
    transcription_json = "filtered_transcription.json"
    final_output_path = "chaptered_transcription.json"
    # Generate Transcription if Not Present
    if not os.path.exists(transcription_json):
        print(f"Transcription file '{transcription_json}' not found. Generating...")
        transcript = main_transcription(input_video, transcription_json)
        # The main_transcription function is located in src/utils.py and will be explained in Step 4
    else:
        print(f"Transcription file '{transcription_json}' already exists. Skipping transcription generation.")

if __name__ == "__main__":
    # Specify input video path
    input_video = "input/2_Modular_Code_for_Training.mp4"
    # Run the main process
    main(input_video)

Output:

(.venv) PS E:\video analyzer> python engine_openai.py
Transcription file 'filtered_transcription.json' not found. Generating...
100%|████████████████████████████████████████████████████████████████| 90506/90506 [01:59<00:00, 755.27frames/s]
Filtered transcription completed and saved to filtered_transcription.json
(.venv) PS E:\video analyzer>

Step 3: Import required libraries and functions

import json
import cv2
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import whisper_timestamped as whisper
import ssl
from openai import OpenAI
import logging
import time
import os

Code Explanation:

import json - Enables JSON data handling in Python (parsing, generating, and file I/O)
import cv2 - Imports OpenCV for computer vision tasks like video processing, image analysis, and object detection.
from PIL import Image - Imports the Image class from Python Imaging Library (Pillow) for basic image processing and manipulation.
from transformers import BlipProcessor, BlipForConditionalGeneration - Imports BLIP models for vision-language tasks like image captioning and visual question answering.
import whisper_timestamped as whisper - Loads the enhanced Whisper library for generating transcriptions with word-level timestamps from audio/video.
import ssl - Imports Python's Secure Sockets Layer module for handling encrypted connections and certificates.
from openai import OpenAI - Imports the official OpenAI Python client to interact with AI models (GPT, DALL·E, Whisper) for text/audio processing.
import logging - Imports Python's built-in logging module for tracking events, debugging, and recording runtime information.
import time - Imports Python's time module for handling time-related functions like delays, timestamps, and performance measurement.
import os - Used for interacting with the operating system (e.g., file paths, environment variables).

Step 4: main_transcription function

def main_transcription(input_video_path, output_json_path):
    # Generates filtered transcription from a video
    audio = load_audio(input_video_path)
    model = whisper.load_model("tiny", "cpu")
    transcription = transcribe_audio(model, audio)
    filtered_result = filter_transcription(transcription)
    save_transcription(output_json_path, filtered_result)
    print(f"Filtered transcription completed and saved to {output_json_path}")

Output:

Filtered transcription completed and saved to filtered_transcription.json

Code Explanation:

The main_transcription function accepts two arguments. first one is the input_video_path, its is the video you want to transcribe and second one is the file path where the final JSON transcription will be saved.
Defines 4 required fields with descriptions:
1. audio = load_audio(input_video_path):
  This function extract audio file from video file
2. model = whisper.load_model("tiny", "cpu"): It loads a pre-trained Whisper speech recognition model. The "tiny" model is chosen for its small size and efficiency, and it's set to run on the CPU.
3. transcription = transcribe_audio(model, audio): It perform the actual speech-to-text conversion. It uses the loaded model to transcribe the audio data
4. filtered_result = filter_transcription(transcription):
  This filtering could involve removing filler words (like "um," "uh"), correcting punctuation, or cleaning up the text to make it more readable and concise.

Step 4: main_transcription function

def main_transcription(input_video_path, output_json_path):
    # Generates filtered transcription from a video
    audio = load_audio(input_video_path)
    model = whisper.load_model("tiny", "cpu")
    transcription = transcribe_audio(model, audio)
    filtered_result = filter_transcription(transcription)
    save_transcription(output_json_path, filtered_result)
    print(f"Filtered transcription completed and saved to {output_json_path}")

Output:

Filtered transcription completed and saved to filtered_transcription.json

Code Explanation:

The main_transcription function accepts two arguments. first one is the input_video_path, its is the video you want to transcribe and second one is the file path where the final JSON transcription will be saved.
Defines 4 required fields with descriptions:
1. audio = load_audio(input_video_path):
  This function extract audio file from video file
2. model = whisper.load_model("tiny", "cpu"): It loads a pre-trained Whisper speech recognition model. The "tiny" model is chosen for its small size and efficiency, and it's set to run on the CPU.
3. transcription = transcribe_audio(model, audio): It perform the actual speech-to-text conversion. It uses the loaded model to transcribe the audio data
4. filtered_result = filter_transcription(transcription):
  This filtering could involve removing filler words (like "um," "uh"), correcting punctuation, or cleaning up the text to make it more readable and concise.