The goal of this project was to create a privacy-preserving voice assistant to control my IoT devices, namely my lights and TV. Ever since I found out that even my smart lights were trying to talk to a server all the way in China to function, I decided to try to keep my data as local as possible. I set up a Home Assistant based server to control all of these devices locally, and to offer an interface for the voice assistant. So, the voice assistant itself will transcribe some voice snippets to text after being woken by a wakeword, send that transcription to the server over MQTT where it will be parsed by my rudimentary intent engine, after which the server will send the intended command to my devices. The voice assistant itself is fully independent, and does not need a continuous connection to any other computer to function.
There are many steps in Dexter's workflow - I'll go through each step explaining what I did to make them functional.
1. Listen for wakeword with an always-on program
2. Listen for command and transcribe
3. Send command to local server for processing
4. Parse command and perform desired function
My main program is a python script - it runs an infinite loop, calling the wakeword detection program via a script, and moving to collect audio when a wakeword is detected, and finally transcribing and sending that command to my server. Then, the loop begins again, and calls the wakeword detection program.
To get the device functional, I first needed to connect it to wifi, and have it
connect on boot. I initially tried this with a cron job, and using wpa_supplicant, but the cron job was unreliable. Instead, network-manager was a hassle-free solution to immediate wireless connectivity. Assigning the pocketbeagle a static local IP meant that I could also always access it over SSH to use the command line, or simply connect to the cloud9 IDE using a web browser. A
As an aside - please don't use any RealTek chipset based WiFi card if you try to replicate this. It was incredibly painful to install some community-compiled drivers to get it functional. Thankfully, I eventually found this GitHub project that took care of it.Listen for Wakeword
For on-device wakeword detection, I'm using Porcupine, an incredibly lightweight program that is also designed to function on BeagleBone devices. First, we need to create an account at picovoice.ai (don't worry, it's free for personal use) to gain an access key. While Porcupine is fully local, it does require an access key at initialization. Next, we need to train a model for the wakeword of our choosing on the picovoice console. Dexter is an excellent choice for a wakeword because it doesn't contain sounds commonly found in other words, reducing the chance of false positives. The 'x' in Dexter (and even Alexa) are incredibly useful for this reason.
Immediately, this project gets quite hacky - the python implementation of Porcupine faced major issues on the PocketBeagle, throwing all kinds of errors before I gave up. Instead, I tried the C implementation, and was able to get their command-line demo working. However, I have zero experience working with C, and was unable to fully leverage their SDK. The given demo file was designed to be called from the command line, and run an infinite loop of hotword detection, only returning 'Hotword detected' when detected. I modified the demo file to instead exit out after the hotword was detected, and called it from a python script (which actually called a bash script that called the wakeword program). This script only progresses once the wakeword program ends, so this was a fully functional workaround to not being familiar with programming in C. A side effect of this is that I need a full Porcupine directory within the cloud9 folder, but is no major issue.
To get the demo functional, it needed to be compiled with cmake, and identify functional microphones, further instructions for which can be found on their github page.Listen for Command and Transcribe
The sub-workflow here is fairly simple - call an arecord process for 3s to create a .wav file, and then upload that .wav file to be processed by IBM Watson. Similar to Porcupine, IBM also requires having an account and an access key. Further instructions on how to set that up can be found on their website. The microphone hat I have records audio at 44100 Hz, an important argument to pass in the call to arecord, which was accomplished by a simple script call.
After the audio file is recorded and stored on the PocketBeagle, another python program uploads that to IBM using their ibm-watson api, and waits to receive a transcription.Send Commands to Local Server
Once the command has been converted to text, the system sends a message to the local server, over the MQTT protocol. I've assigned a static local IP to the server, and used Mosquitto to set up an MQTT broker service. I'm not going to try to explain how to set that up, as there are plenty of resources for that. I used the Home Assistant integration of Mosquitto to make this as painless as possible.
To send commands from python, I used the paho-mqtt library.Parsing Command and Performing Function
To understand the commands being sent, I wrote a rudimentary intent engine. For each statement, the engine attempts to identify an entity (in my room, one of lights 1, 2, or 3, the TV, or custom groups like the whole room). Then, it attempts to identify an intent. That's something like turn on, or turn red, or get brighter. It does this by checking the command string and seeing if it contains an entity keyword, and seeing if it contains an intent keyword. So, the command 'turn the TV on', and 'TV on' are functionally identical as far as the engine is confirmed. By default, if no entity or intents are found, the entity is set to the whole room, and the intent is set to turn them white.
This is an optional use of the parsed command - MQTT can be used in any way you see fit.
Using the system is straightforward - once the system boots up, you can log into the IDE via a browser, or ssh into the terminal with PuTTY. Then, running the RunDexter.sh script launches Dexter into a perpetual state of activeness. Saying 'Dexter' will activate the recording, indicated by the Green LED. Then, you'll have three seconds to give a command, after which the Green LED will turn off indicating that recording is complete. This recording then gets uploaded to IBM in the background, and the system takes care of the rest, executing commands as you give them.
One quirk of using an AI-based voice recognition like IBM's, is the tendency of the speech to text system to look for complete phrases that make sense. So, something like 'light purple' might not make sense to the AI, and it might return some ridiculous interpretation that does not make sense, and you'll trigger a 'room white' result, because of the above defaults. So, a phrase more like 'turn the lights the color purple' will have a higher chance of being correctly recognized. In my testing however, the way the system reacted was perfectly fine if you spoke clearly and were smart with your statements.System Wiring
This was largely a coding project, involving many hacky scripts. However, the following image shows the wiring setup for the PocketBeagle.
In the future, I intend to expand the system's capabilities. As it stands, Dexter does not talk back to you, and cannot execute multiple commands in one statement. Both of these are something I'll be working on soon.