A Python CLI that generates structured transcripts from any YouTube video.

The tool follows a simple strategy: if a video already contains captions, it downloads and parses them. If captions are unavailable, it automatically falls back to local speech-to-text by downloading the video, extracting its audio, and transcribing it with faster-whisper. Regardless of the execution path, both outputs are normalized into the same timestamped transcript format.

The primary design goals are:

Offline-first transcription
No paid APIs
Minimal external dependencies
Consistent output format for downstream NLP and retrieval tasks

Pipeline Overview

The pipeline begins by accepting a YouTube URL and retrieving the video's metadata. If subtitles are available, they are downloaded as an SRT file and parsed directly into transcript segments.

When subtitles are unavailable, the application downloads the video, extracts its audio using FFmpeg, and transcribes it locally using faster-whisper.

Both execution paths converge into a common normalization stage where every transcript is represented as an array of timestamped segments. Finally, the processed transcript is exported in two formats:

transcript.txt — human-readable transcript
transcript.json — structured JSON for programmatic use

Local Speech-to-Text with faster-whisper

When subtitles are unavailable, transcription is performed locally using faster-whisper.

Unlike OpenAI's reference Whisper implementation, faster-whisper runs inference through CTranslate2, a highly optimized C++ inference engine. By default it quantizes model weights to INT8, significantly reducing memory usage while improving inference speed with minimal accuracy loss on clean audio.

from faster_whisper import WhisperModel

model = WhisperModel(
    "base",
    device="cpu",
    compute_type="int8",
)

Transcription is straightforward:

def transcribe_audio(audio_path):
    whisper_segments, info = model.transcribe(
        audio_path,
        vad_filter=True
    )

    print(
        "Detected language '%s' with probability %f"
        % (info.language, info.language_probability)
    )

    parsed_segments = []

    for segment in whisper_segments:
        parsed_segments.append({
            "start": segment.start,
            "end": segment.end,
            "text": segment.text.strip()
        })

    return parsed_segments

The model returns an iterator of timestamped segments along with metadata such as the detected language and confidence score. Each segment is converted into a normalized schema so that subtitle-based and speech-to-text transcripts share an identical structure.

Audio Extraction with FFmpeg

Whisper expects mono, 16 kHz WAV input. Instead of decoding an entire video, the pipeline extracts only the audio stream, minimizing unnecessary disk I/O and processing overhead.

import subprocess
from pathlib import Path


def extract_audio(video_path, output_dir):
    output_path = Path(output_dir) / "audio.wav"

    command = [
        "ffmpeg",
        "-y",
        "-i", video_path,
        "-vn",
        "-ac", "1",
        "-ar", "16000",
        str(output_path),
    ]

    try:
        subprocess.run(
            command,
            stdout=subprocess.DEVNULL,
            stderr=subprocess.PIPE,
            check=True,
        )

        print("Audio extracted successfully!")
        return output_path

    except FileNotFoundError:
        raise RuntimeError(
            "FFmpeg is not installed or available in PATH."
        )

    except subprocess.CalledProcessError as error:
        print(f"Error extracting audio: {error}")

The important FFmpeg flags are:

Flag

Purpose

-vn

Discards the video stream entirely

-ac 1

Downmixes audio to mono

-ar 16000

Resamples audio to 16 kHz

Because only the audio stream is processed, extraction is significantly faster than decoding the entire video. The generated WAV file can then be passed directly to faster-whisper for transcription.

Unified Transcript Schema

Whether the transcript originates from YouTube captions or local speech recognition, both are normalized into the same structure before being written to disk.

{
  "segments": [
    {
      "start": 0.0,
      "end": 4.32,
      "text": "Hello everyone."
    },
    {
      "start": 4.33,
      "end": 9.01,
      "text": "Today we're looking at..."
    }
  ]
}

A plain-text version is also generated:

[0.00 → 4.32]
Hello everyone.

[4.33 → 9.01]
Today we're looking at...

Representing timestamps as floating-point seconds makes the output immediately usable for downstream tasks such as:

RAG document chunking
Semantic search
Subtitle generation
Speaker diarization
Video indexing

Tech Stack

yt-dlp

Downloads YouTube videos and extracts subtitle tracks (.srt/.vtt) when available. It also handles format selection, authentication cookies, and rate limiting.

faster-whisper

A CTranslate2-powered implementation of OpenAI's Whisper model that performs fully local speech-to-text inference. It supports quantized models, language detection, and timestamped transcription without requiring an API key.

FFmpeg

A system-level multimedia toolkit used to extract and convert audio. In this project it is invoked via Python's subprocess module rather than through a Python wrapper.

Python 3.8+

Provides the orchestration layer, using:

argparse for the CLI
pathlib for cross-platform file handling
json for transcript serialization

Running the Project

Create a virtual environment
```
python3 -m venv venv
```
Activate it

Linux / macOS

source venv/bin/activate

Windows (Command Prompt)

venv\Scripts\activate

Windows (PowerShell)

venv\Scripts\Activate.ps1

3. Install dependencies

pip install -r requirements.txt

4. Run the application

python3 main.py --url <youtube_video_url>

Example:

python3 main.py --url "https://youtu.be/jNQXAC9IVRw?si=-fcBEgeyhDUJw47M"