Speech recognition using HMM
Introduction
- Speech recognition is a challenging problem on which much work has been done the last decades. Some of the most successful results have been obtained by using hidden Markov models as explained by Rabiner in 1989.
- A well working generic speech recognizer would enable more efficient communication for everybody, but especially for children, analphabets and people with disabilities. A speech recognizer could also be a subsystem in a speech-to-speech translator.
- The speech recognition system implemented during this project trains one hidden Markov model for each word that it should be able to recognize. The models are trained with labeled training data, and the classification is performed by passing the features to each model and then selecting the best match.
Markov Chain Model
- A stochastic model that describe the probabilities of transition among the states of a system.
- It is a random process that undergoes transitions from one state to another on a state space.
- Change of states depends probabilistically only on the current state of the system
System FLow Chart
Feature Extraction
The source speech is quantized with 16 bits and sampled at 8000 Hz. The signal is divided into 80-sample frames, each equivalent to 10 milliseconds of speech. With 20 samples on each side, the frames overlap. Because of the throat’s limited flexibility, the speech is close to stationary for this short period of time, according to the theory. We’ll extract our features from the frequency domain, but before doing so with the fast Fourier transform, we’ll multiply by a Hamming window to limit spectrum leakage caused by signal framing.
Forward Algorithm
- Used to calculate a belief state: the probability of a state at a certain time, given the history of evidence
- Used to select the model (i.e. word)that most likely generated the speech signal
HMMs can do do three primary tasks
-
State Estimation P(S\O) - can be useful if you have prior info about what states mean and create the state probabilities yourself.
-
Path Estimation - given observations, what is the most likely “state path”? Not useful in our case, and not even implemented here
-
Maximum Likelihood Estimation P(O\λ) - learn the HMM parameters λ which maximize the probability of observations. This is the primary method we will use.
-
We will use Baum Welch algorithm
Baum-Welch algorithm
-
The Baum-Welch algorithm is an iterative expectation-maximization (EM) algorithm that converges to a locally optimal solution from the initialization values.
-
This is an EM algorithm for training the emission and transition probabilities of hidden Markov models in a fully automated way.
-
It can be employed as long as a training set of annotated sequences is known, and provides a rigorous way to derive parameter values which are guaranteed to be at least locally optimal.
- Finding the parameters λ that maximize the likelihood of the observations
- The E-step thus consists of calculating these expectations for a fixed λ
Results
- Time Series representation for banana audio (Amplitude vs Time domain)
- Accuracy of the model
- Confusion matrix for the model
Conclusion & Future Scope
-
During this project a system for isolated-word speech recognition was implemented and tested. The cross-validation results are good for a single speaker.
-
Two obvious extensions are better support for several speakers, and support for continuous speech. The first step towards the former would be more, and more robust, features. For the latter the simplest approach is probably to detect word boundaries and then proceed with an isolated-word 8 recognizer.