cROVER: Context-augmented Speech Recognizer based on Multi-Decoders' Output
Abida, Mohamed Kacem
MetadataShow full item record
The growing need for designing and implementing reliable voice-based human-machine interfaces has inspired intensive research work in the field of voice-enabled systems, and greater robustness and reliability are being sought for those systems. Speech recognition has become ubiquitous. Automated call centers, smart phones, dictation and transcription software are among the many systems currently being designed and involving speech recognition. The need for highly accurate and optimized recognizers has never been more crucial. The research community is very actively involved in developing powerful techniques to combine the existing feature extraction methods for a better and more reliable information capture from the analog signal, as well as enhancing the language and acoustic modeling procedures to better adapt for unseen or distorted speech signal patterns. Most researchers agree that one of the most promising approaches for the problem of reducing the Word Error Rate (WER) in large vocabulary speech transcription, is to combine two or more speech recognizers and then generate a new output, in the expectation that it provides a lower error rate. The research work proposed here aims at enhancing and boosting even further the performance of the well-known Recognizer Output Voting Error Reduction (ROVER) combination technique. This is done through its integration with an error filtering approach. The proposed system is referred to as cROVER, for context-augmented ROVER. The principal idea is to flag erroneous words following the combination of the word transition networks through a scanning process at each slot of the resulting network. This step aims at eliminating some transcription errors and thus facilitating the voting process within ROVER. The error detection technique consists of spotting semantic outliers in a given decoder's transcription output. Due to the fact that most error detection techniques suffer from a high false positive rate, we propose to combine the error filtering techniques to compensate for the poor performance of each of the individual error classifiers. Experimental results, have shown that the proposed cROVER approach is able to reduce the relative WER by almost 10% through adequate combination of speech decoders. The approaches proposed here are generic enough to be used by any number of speech decoders and with any type of error filtering technique. A novel voting mechanism has also been proposed. The new confidence-based voting scheme has been inspired from the cROVER approach. The main idea consists of using the confidence scores collected from the contextual analysis, during the scoring of each word in the transition network. The new voting scheme outperformed ROVER's original voting, by up to 16% in terms of relative WER reduction.