Self-Made Karaoke

I am definitely not the singer type. Never went to any karaoke and probably will never do. But for some reason I was intrigued to see, from a technical standpoint, how to create a karaoke song. How does one remove the vocals from a song and make the lyrics appear at the right time? I will show you the poor man’s approach of making your own karaoke songs and get them to play on a website.

As my main system runs Linux and I’m a terminal guy, I will only use open source command line tools. Most of the tools are probably also available for your platform of choice.

Getting A Song

The example song we will use is “Play Crack the Sky” by Brand New, mainly because of the awesome lyrics. Most songs nowadays are available on YouTube, and this one is no exception. So, let’s just download it from there. The best tool for this job is youtube-dl.

youtuble-dl supports a gazillion video platforms, not just YouTube. If you ever wanted to download a file from a streaming website, this tool should have you covered.

Let us first have a look at the available formats we can choose from with the option -F.

$ youtube-dl -F 'https://www.youtube.com/watch?v=--EeaSYoH04'
[youtube] --EeaSYoH04: Downloading webpage
[youtube] --EeaSYoH04: Downloading video info webpage
[youtube] --EeaSYoH04: Downloading js player vflO1GesB
[youtube] --EeaSYoH04: Downloading js player vflO1GesB
[info] Available formats for --EeaSYoH04:
format code  extension  resolution note
249          webm       audio only tiny   62k , opus @ 50k (48000Hz), 2.12MiB
250          webm       audio only tiny   80k , opus @ 70k (48000Hz), 2.79MiB
140          m4a        audio only tiny  129k , m4a_dash container, mp4a.40.2@128k (44100Hz), 5.03MiB
251          webm       audio only tiny  157k , opus @160k (48000Hz), 5.48MiB
160          mp4        256x144    144p   86k , avc1.4d400c, 15fps, video only, 2.11MiB
134          mp4        640x360    360p  139k , avc1.4d401e, 30fps, video only, 1.97MiB
133          mp4        426x240    240p  180k , avc1.4d4015, 30fps, video only, 4.30MiB
135          mp4        854x480    480p  300k , avc1.4d401f, 30fps, video only, 3.69MiB

Four audio and four video formats are available, in various codecs and qualities. We are just interested in audio. Let us pick the best audio quality and store it to the file song.webm by using the option -f followed by the format code.

$ youtube-dl -f 251 'https://www.youtube.com/watch?v=--EeaSYoH04' -o song.webm

Removing Vocals in the Center

A tiny tool we can use is SoX, “the Swiss Army knife of audio manipulation”, as they call it themselves. sox supports a lot of audio effects, and one of them, called oops, does exactly what we want. It remixes a stereo audio file to a file with two mono channels where each mono channel contains the difference between the two channels in the stereo file. If the vocals are in the center, they get extinguished as they are the same on the left and right channel, no difference.

There are a lot of limitations with this approach. Not every song has the vocals exactly in the center, and even more important: everything else in the center gets removed as well. This can lead to awful sounding results. Our example song works quite well. The only instrument, a guitar, stays unchanged and the vocals are nearly gone, only a faint echo remains.

sox does not support every audio codec and container format on the planet. We have to convert the file first with yet another Swiss Army knife, ffmpeg. It is my favorite tool when it comes to transcoding of media files.

We convert the song with ffmpeg to the wav format which sox understands.

$ ffmpeg -i song.webm song.wav

Next, we remove the vocals by applying the oops effect of sox and store the result to sound.wav.

$ sox song.wav sound.wav oops

Finally, we transcode the resulting wav file back to a webm file, using the opus codec with a bit rate of 48 kbps. You can choose a higher bit rate if you like.

$ ffmpeg -i sound.wav -c:a libopus -b:a 48k sound.webm

The last step is optional, but it reduces the file size considerably.

Have a listen to the file! Any decent media player should be able to play it. Otherwise, drag & drop it into an empty browser window to have a listen.

Splitting Vocals from Instruments

A different approach using artificial neural networks is used by Spleeter. It comes with pre-trained models for TensorFlow and is quite easy to use. We will use it to separate the vocals from everything else and therefore get the music without the vocals.

spleeter converts the input file internally with ffmpeg. We do not have to do it ourselves beforehand. The following command separates the vocals from the music and produces two wav files in the directory output. It downloads the model to a subfolder of the current directory when it runs the first time. Try to stay in this directory as it otherwise downloads the model every time.

$ spleeter separate -i song.webm -p spleeter:2stems -o output

You can optionally encode the wav file with ffmpeg as before.

This approach works for most of the songs I tested it with, but sometimes the resulting file contains quite some awful sounding artifacts. For our example song, I actually prefer the output of sox as it leaves the background vocals unchanged, and the faint echo of the main voice is also nice. In the end, it depends heavily on the song. spleeter works with a much larger variety of songs and does an amazing job for such a difficult task. Hats off!

There are similar tools out there like Open-Unmix which I have not tried yet.

Timed Lyrics

That was all quite easy, wasn’t it? Just invoking some commands. Well, now comes the tedious part.

The easiest way to get text displayed in a timely fashion is by using subtitles. There are multiple formats available. It all depends which media platform we want to target. The “cool kids” are on the web, aren’t they? So, let’s target HTML5.

The subtitle format for HTML5 is WebVTT. The specification is still just a draft and not done yet. Even more problematic, browser support is lacking a lot of the more interesting features like proper time-tag support. Styling with CSS is also hit and miss. It might work with some browsers but not with others. Therefore, I will focus only on the basic functionality which has support in all modern browsers.

Like all standard web formats, WebVTT is a text format and can be created with any text editor. Here is a basic example.

WEBVTT

00:01.000 --> 00:10.000
Hello, World!

00:10.000 --> 00:15.000
This is a WebVTT file.

Every WebVTT file has to begin with the string “WEBVTT” followed by a blank line. The main part of the file consists of a sequence of cues. Each cue is active for a certain timespan specified by a start time and an end time. During this timespan, it displays a text segment which can span multiple lines. Cues can overlap which means that multiple text segments are displayed at the same time. A blank line separates two cues from each other.

Well then, that’s all there is. Get the lyrics from the web, listen through the song and format the lyrics with timestamps accordingly. All you need is a text editor and some stamina to get through the tedious work.

I did it for the example song. You can download the file here.

Playback in HTML5

Unfortunately, there seems to be no way to get the audio-tag of HTML5 and WebVTT playing along nicely. It did not work in any of the browsers I tried. The only work around I found working consistently in all browsers was to add a video track to the audio file, resulting in a video file which works fine with the video-tag and WebVTT.

The following command creates a black image file which we will use as video image. You can use any other image.

$ convert -size 640x480 xc:black black.png

Next, we use ffmpeg to create a video from the image and our audio file without the vocals.

$ ffmpeg -loop 1 -i black.png -i sound.webm -c:v libvpx-vp9 -c:a copy -shortest karaoke.webm

We loop the image forever with -loop 1, the two input files follow. Then, we specify the video codec to be VP9 by using the encoder library libvpx-vp9. The audio should be just copied into the video file which we do with the option -c:a copy. Finally, we specify with -shortest that we want to stop encoding when one of the inputs ends. This option is required as the image loops forever.

Now, we can put it all together on a website with following HTML-code.

<video controls src="karaoke.webm">
	<track default src="lyrics.vtt">
</video>

You can try out the video file below, if you have JavaScript enabled in your browser. The lyrics file is added automatically.


Choose the video webm file:


Conclusion

It was interesting fiddling around with the different tools and formats. I learned a bunch of new things. I hope that tools like spleeter improve over time with more and better training data. The same goes for the WebVTT support in the browsers. Only the basic functionality is usable, but it would be nice to highlight the currently sung word on a line. The specification of WebVTT supports it, but no browser does.

Let us hope for a brighter future where we can sing along our favorite songs not caring what our neighbors might think about it.

References