The following is a redacted interview between Greg (CEO @ HumanFirst) and Cobus Greyling (; in this interview, we dig deep into the intersection of NLU and voice / ASR technologies.

[Greg] Great to chat with you again Cobus. As I mentioned during our last chat, we are huge fans of your blog posts at HumanFirst :) And our last conversation ended with me asking about ASR quality and how to solve for it… OK if we start from there?

Cobus: Great to speak with you again Greg. Indeed, let’s kick off with the ASR problem…

[Greg] Sounds good…so what options are available to developers and makers when it comes to ASR improvement?

Cobus: Thank you. Well, so much comes to mind with this question Greg. I just want to add a caveat right at the onset, I will mostly speak on voice or speech interfaces accessible via a telephone call. Talking about dedicated devices like Google Home or Alexa are a whole different kettle of fish. But if it’s OK with you, let’s stick to more custom non-device dependant speech interfaces…

[Greg] That definitely makes sense, and there seem to be a growing market of more ubiquitous speech interfaces.

Cobus: Great, let’s start with languages, then move to acoustic models and then lastly quality measurements.

First, Automatic Speech Recognition, also referred to as Speech To Text is basically the process of converting user speech into text. Subsequently this text will be sent to a NLP or NLU layer…and the problem of supported languages surface at the ASR level of course.

For NLU there are many environments where provision is made for minority languages. Rasa comes to mind, where an NLU API can be created quite easily and with an open and highly configureable API architecture: in this context a featurizer called the BytePairFeaturize for minority languages can be implemented. So there is a real advantage to NLU, in that a very small set of training data is adequate in most cases. Even without language specific embeddings or language models, good results can be achieved with limited training data on a universal data model. Rasa, Cognigy, IBM come to mind in this instance.

However, with ASR a large amount of training data is required and it's highly specialized, so you need to settle with the languages your ASR provider has, in most cases. This is where you will see newer environments not having their in-house ASR and STT, and leveraging the larger providers like IBM, Microsoft, NVIDIA and others. Listening to the latest podcasts of Voicebot.AI, the synthetic voice tools are growing rapidly not only in number but astounding capability.

I think the challenge is to get high transcription accuracy on things like names for products and services, i.e: named entities and the like. Most commercial ASR solutions, like Microsoft and IBM, allow for creating a machine learning model as an extension of their standard language model. This ML model can consist of text in the form of words and sentences.

The ideal though is to create an acoustic model with voice recordings containing real-world customer speech. These voice files in turn need to be human transcribed to text.

[Greg] This ML model you are referring to, how specialized is it?

Cobus: It really depends on the framework you use for the ASR. I guess there are frameworks which are highly technical and tricky. But with an environment like IBM or Microsoft it is very much a case of preparing your training audio files, with the corresponding human transcribed text and uploading the data. Then running the training process via a web interface.

For example, a set of 20,000 customer recordings can be used of more or less 10 seconds each. These recordings need to be representative of customer speak in terms of accent, gender, age, ethnicity etc. These recordings then need to be accurately transcribed by humans. After processing the audio and text it is uploaded and the acoustic model is created.

A portion of data can be excluded from the training data for purposes of testing and vetting the accuracy of the acoustic model.

[Greg] Can we talk about the current ecosystem?

Sure…if I think of the current ecosystem, one of the most exciting developments currently is happening with NVIDIA Riva. NVIDIA has a solution with speech recognition and speech synthesis. They are also quite astute in The NLU and and NLP area.

I guess their challenge currently lies with dialogue state management for which they currently do not have a solution. NVIDIA does have a very good demo application, making use of Rasa machine learning stories and Google dialogue flow. These demos are really good at illustrating how flexible NVIDIA Riva actually is in how these two very different dialogue state management and development systems can be used either separately or combined.

Moving on from NVIDIA, Microsoft and IBM are definitely up there right now.

Let me just track back and mention one more thing on NVIDIA…NVIDIA is really geared for edge installations so in this case we are thinking of autonomous vehicles and the like. NVIDIA is really focusing on very short or minimal latency of only a few milli-seconds. But I think the area that Riva also addresses very well is the question of face-speed.

I hope you don’t mind me spending time on this idea of face-speed. The first time I heard the term face-speed was from Fjord Design & Innovation. And it is really the idea that when we speak and communicate as humans face-to-face, we do not only converse using words. There is a much faster and a much broader way of communicating taking place in parallel with speaking. And this is, when speaking we look for facial cues, we notice if someone is distracted or if they paying attention to the conversations or looking around. We read facial expressions to determine if the listener is interested or if there is absence of mind, or disbelieve or surprise. So we pass these nuances at high speed along during really any conversation.

So these elements of Gesture Recognition, Lip Activity Detection, Object Detection, Gaze Detection and Sentiment Detection form an exciting part of their collection of functionality. Hence NVIDIA is poised to become a true Conversational Agent.

Back to IBM and Microsoft: both are also right up there with their speech interfaces in terms of technology; both are deployed as SaaS cloud solutions (I assume that for the right price an on-premise deployment is also possible). And both of these environments have got an augmented cloud environment to support voice solutions.

Microsoft has a comprehensive speech studio solution which is really a very astute speech recognition engine where you can train acoustic models to fit your specific accents, ethnicity, gender, ages, pronunciation traits and more.

Microsoft also has a very good neural voice: with a few hours of voice audio from a voice artist, you can create a custom owned branded synthesized voice; IBM has got solutions with similar capabilities.

I would say, there is movement in the market and a definite focus from solution providers on voice. This is evident with Google Dialogflow CX which is really focused on voice enablement for Call Centre solutions and automating traditional phone calls. And then there are also products like the Cognigy AI-based voice gateway for virtual voice agents to automate phone conversations.

[Greg] Let’s get into testing the model, what metrics or procedures are commonly used to test the improvement of the ASR?

Cobus: There is a common metric, called the word error rate, and there is another method I have used in the past.

Let’s look at word error rate first. The best way to test an acoustic model is to create a benchmark test data set, this is a smaller sample of recordings with human transcribed text. After the benchmark audio is transcribed by the acoustic model, the ASR machine transcribed text can be compared to the human transcribed text.

When the two sets transcripts are compared, the additions, deletions and substitutions performed by the acoustic model are counted. This total in turn is divided by the number of words transcribed in total, and hence the word error rate percentage is calculated as a percentage.

It makes sense to calculate the word error rate against the base ASR model, and then again after the acoustic model is added. This helps to see the improved introduced by the acoustic model.

Another method I have used, is to take the training data of the acoustic model, which consist of audio and human transcribed verbatims or text, and run the text sentences against the NLU model. Then use ASR to transcribe the training audio to text and run these machine transcriptions against the NLU model. From here the deviation can be calculated, I have seen a deviation of 4 to 5% which is really encouraging.

[Greg] There are really disparate word error rate percentages out there, and some claims seem too good to be true…

Cobus: Indeed, some word error rate percentages are seemingly too good to be true. There are really a few factors to consider. Firstly, the ethnicity, age, gender, regional accents and more, are prevalent in user speech. If this is niche, like in the case of South Africa, achieving higher accuracy is a challenge. Background noise is also a challenge. I would say the benchmark data can also be artificially cleaned for enhanced word error rate scores. But anything in the vicinity of 15% is exceptional for a locale not defined specifically for a language.

[Greg] How should background noise and silence be dealt with?

Cobus: …well…silence is easy to detect. Background noise is harder, and here we used a crude yet effective approach. [Laughs]. People use speech interfaces in noisy environments, with music and other media playing in the background.

So we found that the acoustic model translates noise in a very specific manner. The acoustic model defaulted to specific text when noise was detected instead of intelligible speech. In the end, our solution was easy, should we find a certain selection of text, it was highly likely the user was in a noisy environment, an we could advise the user to move to a quiet spot or call back later.

[Greg] You mentioned NLU, are you of the opinion that a single or multiple NLU models should be used?

Cobus: Voice is a vastly different beast to text or chat. With chat you have spelling correction, abbreviations and the like. With voice you can have background noise, other people speaking in the background et cetera. Hence the transcribed text look much different from human entered text.

My experience is that when a voicebot launches, learnings can be gleaned from any existing chat implementation, but the voicebot or voice interface really needs a separate NLU model.

As the acoustic model settles, and there is consistency in the transcription, it also bring stability and improved intent and entity recognition to the NLU model. I’m sure you won’t mind me saying so, but in this instance a tool like HumanFirst really comes into its own. The transcribed sentences can easily be analysed using HumanFirst. Since the framework does not need predefined intents, new clusters of intent and meaning can be detected.

HumanFirst as a product can actually play a far more crucial role in a voicebot as a chatbot. Mostly due to the open nature of the user interface.

[Greg] Thank you for that insight. Now the age old question, can a chatbot be repurposed as a voicebot?

Cobus: There are a few considerations when having an established chatbot and wanting to expand into a voicebot. The first being that speech is ephemeral, it evaporates and people do not pay attention. So with a voicebot the dialogues should be really short. With output it does make sense to have a mixed modality. So with speech it makes sense to leverage SMS or text, WhatsApp and other mediums as a way to send longer informational pieces to the user. Think of an Alexa Show device, where the input is speech, but the response is text, images and speech.

The second challenge I would say is that of invisible design affordances. With chat the medium really determines which design affordances are available. With Facebook Messenger there are quick replies, buttons, marques, web views and more. The same can be said about web. Going to Telegram and WhatsApp the design affordances diminish considerably. With text or SMS its virtually non-existent.

With voice there are really limited design affordances, and most are really invisible to the user.

Something else to keep in mind is that often with a new interface there are loose patterns of user behaviour. As users become familiar with an interface, these loose patterns of behaviour become established conventions. Designers and users alike adhere to these conventions. For instance, think of how we interact with a surface. We know how to tap, swipe, swipe up, swipe down et cetera.

Turn taking and interruption are still problems in voice, and seemingly not such a challenge in chat, but a real problem to solve in voice.

And lastly silence. API integration touch points must have sub-second response times for voice applications. Silence during a conversation is a potential break point in the conversation. Users get the impression that the call has disconnected, or they start speaking which adds additional processing load on the system.

[Greg] You mentioned previously the emergence of seven vertical vectors in Conversational AI, is that a topic we could save for a follow-up conversation?

Cobus: Perhaps in short, the Seven Vertical Vectors covers training data, dialog management, NLU, personalisation, testing & QA, voice enablement and lastly conversation analysis. These verticals are often niche and some platforms target one or more of these vectors. As you said, this is a discussion for another day…[laughs]

[Greg] Indeed, let’s do this again soon, thanks again for your time. It is much appreciated.

Cobus: You are most welcome :)

HumanFirst is like Excel, for Natural Language Data.
A complete productivity suite to transform natural language into business insights and AI training data.