Automatic Subtitle creation and translation (to English) with OpenAI’s Whisper

I’ve been having the issue over the last decades that the available subtitles from the well known sites are quite often not suitable for the videos that I have saved locally. Often downloaded videos, especially from YouTube, don’t have sub-titles, which is a requirement in my household, as we speak different languages. (A guide on how to download playlists or even whole channels from YouTube, can be found here) The cumbersome way of getting proper subtitles was to manually extract the audio, upload to YouTube and then download the .vtt.
In this age of emerging A.I., there is an easier way. Hence this Python Script, which will extract the audio and with the help of OpenAI’s Whisper, transcribe it (and translate it to English if so wished) to a subtitle file (SRT) to be placed directly with the video file.

I found that the model “small” is sufficient for most videos, however when you have a video with a lot of idioms or non-standard phrasing, “medium” is a better option. While a lot more accurate, the latter will take longer to subscribe the video and uses more resources.
Note: I’m using Ubuntu on WSL2.0 to run this, as the Nvidia GPU integration is a lot better in Linux than directly on Windows.

Install Required Software

For GPU processing with a Nvidia GPU, you will need to install the Nvidia CUDA Toolkit.

  • Python: Ensure you have Python downloaded, installed and available on your system.
  • FFmpeg: on WSL this can be installed with “sudo apt install ffmpeg”
  • FFmpeg-python: on WSL install with: “pip install ffmpeg-python”
  • tqdm: on WSL install with: “pip install tqdm”
  • OpenAI-Whisper: on WSL, install with: “pip install openai-whisper”

Once installed, copy the script below into your favourite editor and save it as a .py file. To run, use any of the options below.

The Script has the following options:

  • --video_dir <path>: Specifies a directory containing video files to process. The script will recursively search for video files within this directory.
  • --model <model_name>: Selects the Whisper model to use for transcription. Available models can be listed by running the script with the --help option. Common models include “tiny”, “base”, “small”, “medium”, and “large”.
  • --language <language_code>: Sets the language code for transcription or translation. Use ‘auto’ for automatic detection, or specify a language code (e.g., “en” for English).
  • --task <task>: Determines the task to perform. Options are “transcribe” for transcription of the original audio without translation, or “translate” for translating the audio content to English.
  • --generate_srt: A flag that, when set, instructs the script to generate SRT (SubRip subtitle) files for the processed video files. This is set to True by default.
  • <videos>: One or more paths to individual video files to process. This argument allows for the processing of specific video files without specifying a directory.

Usage Examples:

  1. Transcribing All Videos in a Directory:
    python script_name.py --video_dir "/path/to/video/files" --model base --language en --task transcribe
    This command transcribes all videos in the specified directory using the “base” Whisper model and English language.
  2. Transcribing a Single Video File:
    python script_name.py --model small --language es --task translate /path/to/video/file.mp4
    This command transcribes the Spanish audio in the specified video file to an English .srt file using the “small” Whisper model.
  3. Processing Multiple Specific Video Files without Translation:
    python script_name.py --model tiny --language auto --generate_srt video1.mp4 video2.mkv
    This command transcribes two specific video files using the “tiny” Whisper model, with language detection set to automatic. SRT files will be generated for each video in the detected language.

Download the .zip file here or copy the script below

Python Script:

import argparse
import ffmpeg
import glob
import os
import sys
import tempfile
import torch
import whisper
from tqdm import tqdm

def setup_environment():
    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
    print("CUDA GPU is available. Whisper will utilise the GPU for faster processing." if torch.cuda.is_available() else "CUDA GPU not available, falling back to CPU.")

def str2bool(v):
    if isinstance(v, bool):
        return v
    return v.lower() in ('yes', 'true', 't', 'y', '1')

def format_timestamp(seconds):
    hours, remainder = divmod(int(seconds), 3600)
    minutes, seconds = divmod(remainder, 60)
    milliseconds = int((seconds - int(seconds)) * 1000)
    return f"{hours:02}:{minutes:02}:{seconds:02},{milliseconds:03}"

def write_srt(segments, srt_path):
    with open(srt_path, "w", encoding="utf-8") as f:
        for i, segment in enumerate(segments, start=1):
            f.write(f"{i}\n{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n{segment['text']}\n\n")

def extract_audio(video_path):
    audio_path = os.path.join(tempfile.mkdtemp(), os.path.basename(video_path) + ".wav")
    ffmpeg.input(video_path).output(audio_path, acodec="pcm_s16le", ac=1, ar="16000").run(quiet=True)
    return audio_path

def process_video(video_path, model, language, task):
    srt_path = f"{os.path.splitext(video_path)[0]}.{language}.srt"
    if os.path.exists(srt_path):
        print(f"Skipping {video_path}, SRT file already exists.")
        return

    print(f"Processing: {video_path}")
    audio_path = extract_audio(video_path)
    result = model.transcribe(audio_path, language=language if language != "auto" else None, task=task)
    write_srt(result['segments'], srt_path)
    os.remove(audio_path)
    print(f"Generated SRT: {srt_path}")

def collect_videos(video_dir, videos_arg, language):
    videos_to_process = []
    supported_extensions = ["mp4", "mkv", "ts", "flv", "avi", "mov", "webm", "wmv", "mpg", "m4v"]
    if video_dir:
        for ext in supported_extensions:
            videos_to_process += glob.glob(os.path.join(video_dir, "**", f"*.{ext}"), recursive=True)
    videos_to_process += videos_arg
    # Filter out videos that already have an SRT file for the specified language.
    return [v for v in videos_to_process if not os.path.exists(f"{os.path.splitext(v)[0]}.{language}.srt")]

def main():
    setup_environment()

    parser = argparse.ArgumentParser(description="Transcribe and optionally translate videos using OpenAI Whisper.")
    parser.add_argument("--video_dir", type=str, help="Directory containing video files to process.")
    parser.add_argument("--model", default="base", choices=whisper.available_models(), help="Whisper model to use for transcription. Options are: tiny, base, small, medium, large, large-v2 & large-v3")
    parser.add_argument("--language", type=str, default="en", help="Language code for transcription or translation.")
    parser.add_argument("--task", type=str, default="transcribe", choices=["transcribe", "translate"], help="Task to perform.")
    parser.add_argument("videos", nargs="*", type=str, help="Individual video files to process.")

    args = parser.parse_args()
    model = whisper.load_model(args.model)

    videos_to_process = collect_videos(args.video_dir, args.videos, args.language)

    if not videos_to_process:
        print("No videos found to process.")
        sys.exit(1)

    for video_path in tqdm(videos_to_process, desc="Processing Videos"):
        process_video(video_path, model, args.language, args.task)

if __name__ == "__main__":
    main()

Leave a Reply

Your email address will not be published. Required fields are marked *