Building a Voice Activated Digital Assistant - Part 1

            TS: "J.A.R.V.I.S., are you up?"
            J:  "For you sir, always."
            TS: "I'd like to open a new project file, index as: Mark 2."
            J:  "Shall I store this on the Stark Industries' central database?"

Who would not love to have a intelligible conversation with their computers and make it do work by just asking it to? However, to have a conversation with a computer, the computer should be able to hear us. A simple microphone will enable the system hear us, that's the easy part. But how will it understand this speech? And once it understands, how will do our bidding?

To understand the speech we first need to convert the digital audio into text, parse the text and break down the spoken sentence to simple commands. The former is called speech recognition, and the latter intent parsing. In speech recognition, we transcribe an audio clip to text. There are several ways to do this, one of the most commonly used methods today uses machine learning.

Spoken language is one of the more complex interactions. You can ask your friend a pendrive in several ways, all of which mean the same thing to a human, but "sounds" remarkably different to a computer. "Pass me that pendrive", "Can I borrow that pendrive?", "May I have that pendrive?", "Is that pendrive available for a few minutes?", "Mind if I take that pendrive for a while?" It is here that an intent parser comes to play: all these sentences/questions should result in the same output - "obtain pendrive".

Let us keep the intent parsing aside for the time being and go now look into the speech recognition part. There are several speech recognition softwares available in the market these days - but what's the fun in that? We want something that's open source. ASR has been a particularly prickly thorny bush in linux. While Windows speech recognition softwares have been doing rounds as long ago as early 2000s, it has never been easily available on linux system. Only with the easy availability of machine learning tools like tensorflow, ASR seeing fast growth in linux.

While there are several softwares available - CMU Sphinx, Julius, Natl, Jasper, Vedics and several more; some active, others inactive. However, the two softwares of particular interest to us are Kaldi Speech Recognition ToolKit (through Vosk API) and DeepSpeech. The former has it's origins in Johns Hopkins University and the latter is supported by Mozilla Foundation.

Let's look into the details - merits and demerits of each of the softwares.

Vosk API


Vosk API is available got install via pip (For more details, you can look into their website.

            $ pip3 install --user --upgrade vosk

And viola, you're done, just like that! However, we are missing one thing - a model. Vosk API works based on deep learning models, so we need to have a good model for the ASR engine to work. So, let's get a very simple model ~54 MB when unpacked.

            $ wget
            $ unzip
            $ mv vosk-model-small-en-us-0.3 model

There are other models which can be downloaded from here


Vosk API is very simple, and is very easy to use - at least if you have some grounding in python. With as few a 15 lines of code, you can have your speech recognizer up and running.

            #!/usr/bin/env python3

            from vosk import Model, KaldiRecognizer, SetLogLevel


            wf = "trial.wav", "rb")
            rec = KaldiRecognizer( Model( "model" ), wf.getframerate() )

            while True:
                data = wf.readframes(4000)
                if len( data ) == 0:

                if rec.AcceptWaveform( data ):
                    print( rec.Result() )

                    print( rec.PartialResult() )

            print( rec.FinalResult() )

Save the above lines in a file (say, Grab a "wav" file (should be a mono PCM) and save it as "trial.wav" and you can perform the speech recognition at once:

              $ time python3

If everything is set up properly, you'll get the some output. In my case it was this:

            hi this is a mic test
            open for apps                 <- Should have been: OPEN CORE APPS
            i love que te                 <- Forgiven, but technically it's QT
            ____                          <- Missed this word: OH
            not                           <- Forgiven, but technically it's NAUGHT

The audio clip was 15.47s long, and our ASR took 13.05s to decode it in my case.