The cat may not have your tongue, but does a robot have your voice?
- Posted on June 10, 2019
- Estimated reading time 4 minutes
It’s getting easier to make voice fonts with new AI tech, which is now being provided as a service, like many other AI platforms in the cloud. But what is a voice font, how are they made, and should we be worried?
A text font is the style in which text is transmitted to users. The characters are the same, but their visual delivery can vary greatly. The same holds true of voice fonts, except the delivery this time is audible. While as individuals we all have a style of written font, our handwriting, our style of voice is also unique. With different accents, pitches, and personal quirks, our voices are a recognisable part of who we are, and form part of our identity.
Voice fonts also carry identity. It is irrefutable that the DECtalk synthetic voice used by Professor Stephen Hawking from the early 80s is now part of his identity. He could have updated this voice over time, but he recognised it as part of his image; of his personal brand, which became more important with his rise to fame. The voice is actually based off recordings of Dr. Denis Klatt¬¬, a senior research scientist of synthetic speech at MIT in the 70s and 80s. Now you would actually have to get permission from the estate of Stephen Hawking to be allowed to reproduce the voice for some commercial purpose. You would not however, require permission from Dr. Klatt’s. Hawking’s voice is both legally, and figuratively, part of his very recognisable identity.
The same is now holding true for big brands. The voices of characters, spokespeople, famous cameos, and now AI assistants are part of brand campaigns and brand identity. A change to Alexa’s voice for example, would be considered a rebranding, with all of the complexities involved in that process.
For the likes of Microsoft to get a really good voice for Cortana, they would have required experienced data scientists, speech scientists, algorithms experts, and a voice actor with a huge amount of patience. The recording would have to be split into syllables, matched to the phonetics, the system trained, normalised, tweaked, and tested through multiple iterations. Now, Microsoft has unleashed all of this expertise. Still in preview, and available in a limited number of languages, their custom voice offering allows you to simply upload recordings of a voice, and the transcripts of what was said to create your own voice font. There are still some caveats to get a really good quality synthesis, such as around 8 hours of high-quality recordings in a low noise environment, but the basic implementation is alarmingly easy, and yields a somewhat recognisable result. The system uses a clever combination of existing cognitive services and some back-end Microsoft neural wizardry to create the voice font. Speech to text algorithms are run on the recordings and compared to what was in the human defined transcript, so only the best and most recognisable parts are used, and funny little homophones can be accounted for. It won’t yet be possible to replicate emotion and feeling to these fonts however. Voice fonts are still trying to tackle basic communication, rather than a full range of voice acting.
Any form of digital identity raises concerns. A quick Google will allow you to reproduce any phrase you wish in the style of Stephen Hawking. So what’s next? Could your voice identity be compromised with the increasingly realistic algorithms available to us all? This already renders some bank security systems questionable. ‘Your voice is your password’ – but my voice is now also an emulator I can programme to say anything I please. And this isn’t just pre-calculated phrases, these cloud-based services can convert any text we type to speech in a matter of seconds, and can produce phrases that make more contextual sense at the touch of a button. The other thing to consider is that now all we need to produce a voice font is samples, it will become quite easy to make voice fonts of other people without their knowledge, as long as enough usable recorded material exists. We’ve already seen it done with footage of famous faces, most notably politicians, appearing to perfectly mouth along new audio, but while the video is very close to being convincing, the audio is still clearly identifiable as an impressionist, or low-quality synthesis. Avanade has been focused on digital ethics with our clients and have outlined specific actions organizations need to take today to address issues like voice fonts.
The ‘smell’ of natural gas is actually an odouriser added as a warning of the otherwise colourless, odourless, but highly dangerous gas as if to communicate ‘this shouldn’t be here’. Nintendo coat their newest and smallest games cartridges with foul tasting additive, so if it ends up in a child’s mouth it communicates ‘this shouldn’t be here’. So, do we need the same for AI? Some warning system that lets us know that the increasingly realistic conversations we will be engrossed in, are in fact, an artificial illusion.