How do you listen and speak to a user in the realm of web-based apps? There are more options for both. Let’s take a closer look at them and see the ones I used.

Listening

There are multiple solutions for converting human speech into text. I will lest few for your brief overview:

Some of them are paid, usually after you use a specified amount of free credit for the usage.

The Mk2 prototype of Chef used Web Speech API of Google Chrome. This gives the app a benefit to be run on smartphones with chrome (android only unfortunately – web speech is not present on iOS devices), android tablets and notebooks & desktops with Chrome installed.
The recognition works surprisingly well. There is a possibility to see the words picked by the app in its console, so for the purpose of testing, users can see what the app actually picked up.

Chrome's console of Chef recognising a voice command "next"

Chrome’s console of Chef recognising a voice command “next”

I discovered that Chrome keeps asking for permission to use the user’s microphone every time it listens, if the app doesn’t run via SSL. SSL is an encryption technology, usually used on banking websites to secure the communication to the client. For the purpose of Chef I installed an SSL certificate on my hosting. You can notice that the app URL has https:// on the start. See the illustration to find out how it works.

Chrome microphone usage popup

Chrome microphone usage popup

Speaking

For speaking, the situation is similar – there are dozens of TTS (Text To Speech) services online. Some of them are paid, some are free and some provide free credits for the beginning.

I started off with a Scottish voice from a company called Cereproc. Matthew Aylett from Cereproc gave us a talk on speech synthesis back in 3rd year, which was quite inspiring. Our lecturer Graham connected me with him, so I even got some personal advice from them and more credits. Thank you guys!

Going back to informal tone of voice in my whole app, I wished for the Cereproc voice to sound more encouraging and relaxed. Cereproc offers a SSML functionality (Speech Synthesis Markup Language) – user can modify the way it speaks by certain tags, which can alter the emotion, rate, pitch and other properties of the voice.

SSML

SSML was tested for that purpose and didn’t add too much of an extra value. The same could be said about the IBM Watson’s speech API. Watson doesn’t alter emotions as such, but it offers a great control over many mathetmatical aspects of voice. In the end, Google Speech Synthesis was used due to quite nice tone of its British male voice, with no SSML applied. The documentation is describing a use of SSML so that is possible with Google’s solution as well – If there is enough time before our final deadline, I would be happy trying it out 🙂