| docs | ||
| fonts | ||
| src | ||
| .gitignore | ||
| krunko_cli.py | ||
| LICENSE.md | ||
| README.md | ||
| requirements.txt | ||
| requirements_yt_upload.txt | ||
| ruff.toml | ||
| youtube_upload.py | ||
Krunkó - Automatic Karaoke Generator
Krunkó turns any YouTube video into a karaoke video, automatically! Give it a link to a song and it downloads the audio, separates the vocals from the backing track, transcribes the lyrics with timing information, corrects them using an AI model, and renders a video with synchronized on-screen lyrics. It even adds a little bouncing ball to help you follow along.
The whole pipeline can run locally.
Works well with Icelandic songs! (and many other languages too)
Example output
Example output video generated by Krunkó, using the song "Yfir til þín" by Spaugstofan:
Pipeline
- Download audio from YouTube
- Extract stems (bass, drums, other, vocals) from the audio
- Transcribe the vocals using OpenAI's Whisper model
- Automatically search for the correct lyrics online if a reference is not provided
- Correct the transcription with AI
- Combine the non-vocal audio (bass/drums/other) into a single instrumental track
- Generate a karaoke video with the corrected lyrics
Installation
First install yt-dlp, and make sure it is available in your system's PATH.
Then install the required dependencies:
pip install -r requirements.txt
Setup and configuration
AI model for lyric correction
For the best results, use an LLM to fix the transcription errors from the Whisper model.
Use --correction-model to specify the model to use for correction. The default is openai:gpt-5.5, for that to work supply a OpenAI API key as an environment variable before running the CLI tool:
export OPENAI_API_KEY="your_api_key_here"
But you can use other models, even local models, anything supported by pydantic-ai, see the pydantic-ai documentation for more details.
The AI lyric correction step significantly improves the quality of the transcribed lyrics. The tool still works without this step, but the transcription may contain more errors. Especially for non-English songs.
Usage
Minimal example
Pass a YouTube URL and Krunkó handles the rest:
python krunko_cli.py https://www.youtube.com/watch?v=tqYoJB5WYRY
It is strongly recommended to always specify --lang with the song's language code. Whisper's language auto-detection can fail or pick the wrong language, leading to garbled transcriptions. Even a correct guess is slower than an explicit hint:
python krunko_cli.py https://www.youtube.com/watch?v=tqYoJB5WYRY --lang is
Customization options
Here is an example using several common parameters:
python krunko_cli.py https://www.youtube.com/watch?v=tqYoJB5WYRY \
--lang is \
--title-text "Heyr mína bæn" \
--end-time 185 \
--font ./fonts/DejaVuSans-Bold.ttf \
--show-vocal-waveform \
--no-bouncing-ball \
--title-text— Override the song title shown on the opening screen and used as the default output filename. Useful when the YouTube title contains extra noise like channel names or upload dates.--start-time/--end-time— Crop the source audio to a specific time window (in seconds). Handy for skipping intros, outros, or focusing on a single section.--font— Path to a TTF font file to use for the rendered lyrics.--show-vocal-waveform— Display the vocal waveform below the progress bar for an extra visual aid.--no-bouncing-ball— Disable the bouncing ball if you prefer a simpler karaoke style.
There are many more parameters available. Run the following to see all of them:
python krunko_cli.py --help
Improving results
If the initial output has lyric errors, the following options can help:
-
--lang <code>— Always the first thing to try. Supplying the correct ISO 639-1 language code (e.g.isfor Icelandic,defor German) prevents Whisper from misidentifying the language and avoids a significant class of transcription errors. -
--lyrics-reference <file>— If the automatic lyric search did not find the correct lyrics, or if you have a better source, point Krunkó at a plain-text file containing the correct lyrics. The AI correction step uses this as a reference to align the transcription, which greatly improves word accuracy and spelling.python krunko_cli.py https://www.youtube.com/watch?v=yvpEcXBcrY8 \ --lang is \ --lyrics-reference lyrics/yfirtilthin.txt -
--auto-whisper-initial-prompt— Automatically constructs an initial prompt for the Whisper model based on the video title and passes it in before transcription begins. This can nudge Whisper toward the correct vocabulary and style for the song, which sometimes reduces hallucinations or fixes recurring word errors. Worth trying when transcription quality is inconsistent.python krunko_cli.py https://www.youtube.com/watch?v=yvpEcXBcrY8 \ --lang is \ --auto-whisper-initial-prompt
Manual lyric correction
If all else fails, you can manually edit the transcriptions.
Simple
Copy and edit the lyrics.txt file generated in the output directory. This is a plain text file containing the transcribed lyrics without timing information. Then use that as the reference for correction in the next run, using the --lyrics-reference parameter.
This might fix spelling errors and improve the overall quality of the lyrics, but it won't fix timing issues or misaligned words, or hallucinated words and sentences.
Advanced
JSON transcription files are stored in the cache/ directory after the first run. They contain text and timing information. Find the corresponding .json file for the song, edit it, and then run the CLI tool again with the same URL and it will use your corrected transcription.
Upload to YouTube
There is also a separate script for uploading the generated karaoke video to YouTube, youtube_upload.py. But I can't be bothered to write documentation for that.
Acknowledgements
This project builds on a set of free and open source tools:
- yt-dlp — downloading audio from YouTube.
- demucs — source separation for extracting vocals, bass, drums, and other stems from the downloaded audio.
- Whisper — automatic speech recognition used to transcribe the vocal stem into timing-aware text.
- faster-whisper — a faster Whisper implementation that can speed up transcription on local GPUs and CPUs.
- pydantic-ai — AI model integration for correcting and formatting the raw transcription into polished lyrics.
- pydub — audio processing and manipulation for combining stems and preparing the instrumental track.
- moviepy — video generation for rendering the karaoke visuals, text, and final output video.
- DuckDuckGo Search API — used for automatic lyric search when a reference is not provided.
- BeautifulSoup — web scraping for automatic lyric search when a reference is not provided.
