JR
I'm trying to do a survey of what's out there and what's popular right now in the field of speech recognition. Most of the open source toolkits I found use kind of a multi-stage recognition system. Usually with three different models.
HHR
Are you looking at speech synthesis or recognition?
JR
Recognition. You do get a little bit of overlapping in terms of the models, but I'm just looking at the recognition. Most of these open toolkits use a system that is based around hidden Markov models, with sort of three stages. There's the acoustic model recognition which usually looks at spectral features and things like this and tries to match them against fragments of sound that are smaller than phonemes and tries to create sequences of these sub-phonetic sounds that makes sense to be a phoneme. And then, the next layer after that is phonetic to word translation. And then, finally, you have a language model which kind of determines whether these words make sense next to each other, English. And all three of these work together somehow, in order to limit the possibility space of the phoneme detector.
DP
Ok. So you have a three parallel hidden Markov model?
JR
Well, two hidden Markov models. One is the language model and one is the acoustic model. And then there is an inbetween layer which is a sort of translation dictionary. But it's not so easily separated into "this happens, then this happens then this happens". Because they interplay with each other in order to reduce the searching complexity of trying to match. I think they are using gaussians and then trying to match gaussians to these probability of it being this phoneme or that phoneme or whatever.
HHR
I know that some systems need to be trained in order to recognize specific speakers, but some other systems are supposed to be more general. Is that correct? Because I went through the code you wrote, I saw you are using the Mel-Frequency Cepstrum and so on. But if I'm not totally mistaken, if you use the MFCC the phonemes would look different from speaker to speaker. The formants are obviously in similar regions, that's why we can understand different speakers, but I don't think the MFCC are just formant extracting. They extract everything from the spectrum, if I'm not mistaken.
JR
Well, the way I understand it is there is a sort of representation. They're almost like a second derivative of the time domain, so it's more about the rate of change of the spectrum and different changes in the spectrum. It's very good for detecting these moments of change between steady states when speaking. A phoneme is usually broke into three pieces: there's a changing state, a steady state and a changing state. And these changing states would vary depending on what the letter coming before and what the letter coming after is. And I think this is what really they are focusing on. I know with Sphinx they are taking like 38 features. So their features vector is much larger than the standard 13 or 12 that you would take from an MFCC. So they are taking some other stuff, but as far as I can tell is all spectrum analysis. Probably also loudness contour and maybe pitch deviation or something like that.