Microsoft VocalZoom. Siri, Cortana, Alexa – we’re all more and more reliant on voice recognition for every day interaction with our devices. The intent is great. The execution? Not so much.
There are two big problems that voice recognition software needs to overcome – recognising words with a high degree of accuracy to make search functions effective, and recognising an individual voice where there is a degree of ambient noise. Enter VocalZoom.
Microsoft set a team at their Artificial Intelligence and Research group the challenge of reproducing the same low error rate as that of a human transcriber. They used a Long Short-Term Memory (LSTM) model, trained for 2,000 hours using a 30,000 word vocabulary, then tested it against human transcribers using National Institute of Standards and Technology (NIST) tests involving telephone conversations.
The results were outstanding. On the first test, replicating strangers discussing a pre-arranged topic, human transcribers returned 5.9% errors – as did the voice recognition system. On the second test, where family members have open ended conversations, the Microsoft model did even better, scoring an 11.1% error rate as opposed to the human transcribers’ 11.3%. For the first time, voice recognition software achieved parity with humans.