A Python CLI that generates structured transcripts from any YouTube video.
The tool follows a simple strategy: if a video already contains captions, it downloads and parses them. If captions are unavailable, it automatically falls back to local speech-to-text by downloading the video, extracting its audio, and transcribing it with faster-whisper. Regardless of the execution path, both outputs are normalized into the same timestamped transcript format.
The primary design goals are:
Offline-first transcription
No paid APIs
Minimal external dependencies
Consistent output format for downstream NLP and retrieval tasks
Pipeline Overview
The pipeline begins by accepting a YouTube URL and retrieving the video's metadata. If subtitles are available, they are downloaded as an SRT file and parsed directly into transcript segments.
When subtitles are unavailable, the application downloads the video, extracts its audio using FFmpeg, and transcribes it locally using faster-whisper.
Both execution paths converge into a common normalization stage where every transcript is represented as an array of timestamped segments. Finally, the processed transcript is exported in two formats:
transcript.txt— human-readable transcripttranscript.json— structured JSON for programmatic use
Local Speech-to-Text with faster-whisper
When subtitles are unavailable, transcription is performed locally using faster-whisper.
Unlike OpenAI's reference Whisper implementation, faster-whisper runs inference through CTranslate2, a highly optimized C++ inference engine. By default it quantizes model weights to INT8, significantly reducing memory usage while improving inference speed with minimal accuracy loss on clean audio.
from faster_whisper import WhisperModel
model = WhisperModel(
"base",
device="cpu",
compute_type="int8",
)Transcription is straightforward:
def transcribe_audio(audio_path):
whisper_segments, info = model.transcribe(
audio_path,
vad_filter=True
)
print(
"Detected language '%s' with probability %f"
% (info.language, info.language_probability)
)
parsed_segments = []
for segment in whisper_segments:
parsed_segments.append({
"start": segment.start,
"end": segment.end,
"text": segment.text.strip()
})
return parsed_segmentsThe model returns an iterator of timestamped segments along with metadata such as the detected language and confidence score. Each segment is converted into a normalized schema so that subtitle-based and speech-to-text transcripts share an identical structure.
Audio Extraction with FFmpeg
Whisper expects mono, 16 kHz WAV input. Instead of decoding an entire video, the pipeline extracts only the audio stream, minimizing unnecessary disk I/O and processing overhead.
import subprocess
from pathlib import Path
def extract_audio(video_path, output_dir):
output_path = Path(output_dir) / "audio.wav"
command = [
"ffmpeg",
"-y",
"-i", video_path,
"-vn",
"-ac", "1",
"-ar", "16000",
str(output_path),
]
try:
subprocess.run(
command,
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
check=True,
)
print("Audio extracted successfully!")
return output_path
except FileNotFoundError:
raise RuntimeError(
"FFmpeg is not installed or available in PATH."
)
except subprocess.CalledProcessError as error:
print(f"Error extracting audio: {error}")The important FFmpeg flags are:
Flag
Purpose
-vn
Discards the video stream entirely
-ac 1
Downmixes audio to mono
-ar 16000
Resamples audio to 16 kHz
Because only the audio stream is processed, extraction is significantly faster than decoding the entire video. The generated WAV file can then be passed directly to faster-whisper for transcription.
Unified Transcript Schema
Whether the transcript originates from YouTube captions or local speech recognition, both are normalized into the same structure before being written to disk.
{
"segments": [
{
"start": 0.0,
"end": 4.32,
"text": "Hello everyone."
},
{
"start": 4.33,
"end": 9.01,
"text": "Today we're looking at..."
}
]
}A plain-text version is also generated:
[0.00 → 4.32]
Hello everyone.
[4.33 → 9.01]
Today we're looking at...Representing timestamps as floating-point seconds makes the output immediately usable for downstream tasks such as:
RAG document chunking
Semantic search
Subtitle generation
Speaker diarization
Video indexing
Tech Stack
yt-dlp
Downloads YouTube videos and extracts subtitle tracks (.srt/.vtt) when available. It also handles format selection, authentication cookies, and rate limiting.
faster-whisper
A CTranslate2-powered implementation of OpenAI's Whisper model that performs fully local speech-to-text inference. It supports quantized models, language detection, and timestamped transcription without requiring an API key.
FFmpeg
A system-level multimedia toolkit used to extract and convert audio. In this project it is invoked via Python's subprocess module rather than through a Python wrapper.
Python 3.8+
Provides the orchestration layer, using:
argparsefor the CLIpathlibfor cross-platform file handlingjsonfor transcript serialization
Running the Project
Create a virtual environment
python3 -m venv venvActivate it
Linux / macOS
source venv/bin/activateWindows (Command Prompt)
venv\Scripts\activateWindows (PowerShell)
venv\Scripts\Activate.ps1 3. Install dependencies
pip install -r requirements.txt 4. Run the application
python3 main.py --url <youtube_video_url>Example:
python3 main.py --url "https://youtu.be/jNQXAC9IVRw?si=-fcBEgeyhDUJw47M"
