Voice rec is terrifying

November 15, 2017

Voice rec is terrifying

Voice recognition technology is entering a phase that goes far beyond consumer convenience.

In a world obsessed with Internet privacy it’s surprising how little we talk about always-listening devices like the Amazon Echo. After all, a company that wants to learn intimate details about your life in order to sell you more stuff has a microphone permanently fired up in your kitchen.

If you own an Echo and weren’t aware of this feature, open up your Alexa app, select the “Settings” menu, and then select “History.” Take a listen. Were all of those recordings intended for the Echo?

I guess privacy is the price of convenience in modern consumerism. And things are about to get a whole lot more convenient.

Cacophonies, cocktail parties, convenience, and Christmas

XMOS is a fabless semiconductor company that spun out of the University of Bristol to focus on voice and music processing ICs. Among those ICs, devices based on the 32-bit xCORE MCU architecture have had notable success in the voice recognition market, delivering 16 programmable cores (partitioned into two tiles of eight cores with a shared address space for each) with DSP functions integrated in the same chip.

XMOS recently parlayed the xCORE architecture into the VocalFusion 4-Mic Dev Kit for Amazon’s Alexa Voice Service (AVS). The kit is designed around the VocalFusion XVF3000 integrated far-field voice processor and four high signal-to-noise-ratio (SNR) MEMS microphones from Infineon (Figure 1). XMOS claims the kit is the first far-field linear microphone array solution available on the market.


Figure 1. The XMOS VocalFusion 4-Mic Dev Kit for Amazon’s Alexa Voice Service (AVS) is based on the XVF3000 integrated far-field voice processor and a linear MEMS microphone array from Infineon.

Outside of range, far-field voice processing gets really interesting when combating the “cocktail party” problem, or situations in which a platform needs to isolate the voice of a single speaker from a noisy environment. At distances of 5 m or more, the VocalFusion 4-Mic Dev Kit uses a combination of acoustic echo cancellation (AEC), adaptive beamforming, dynamic de-reverberation, and automatic gain control (AGC) to isolate and extract the voice signal of a primary speaker. Beyond this is where things start to get spooky.

Earlier this year, XMOS acquired Setem Technologies, Inc. of Boston, MA, who develops massive Fourier transforms for blind-source signal separation. These blind-source separation algorithms mathematically decompose elements of source signals from a set of signals and then reconstruct them, either individually or as groups (Figure 2). In voice recognition this can be applied to an individual speaker, or even a conversation.

Figure 2. Setem Technologies, now a part of XMOS, develops blind-source separation algorithms that can be used to isolate a speaker or speakers in noisy environments.

Now, in theory (and perhaps in practice), blind-source separation can be used to isolate the voice frequencies of multiple speakers in a room, and thereby establish a biometric identity for each. As you can imagine, the application of such technology could be widespread, and not just in the sense that Amazon wants to know what every member of your family wants for Christmas. Surveillance, for instance, immediately comes to mind.

This takes us back to the VocalFusion 4-Mic Dev Kit’s linear microphone array. While many platforms such as the Amazon Echo and Google Home use a circular array of omni-directional microphones to provide 360-degree room coverage, a linear array is designed for 180-degree arcs. This is of interest because leaders in the voice recognition space envision a future where the tower-based virtual assistants of today recede into everyday objects like TVs, refrigerators, sofas, walls – you name it.

This future is designed to be ultra-convenient, delivering service by the syllable. But be careful. You probably won’t know who, or what, is listening.