users\dbansal\max\dip

SPHINX-II ACOUSTIC TRAINING STEPS

The binaries mentioned in some of the script can be found at /net/alf19/usr2/eht/s3/ in cmu speech domain.
The files correspond to the training of Haitian-Creole acoustic models used in DIPLOMAT.

Step1: Make MFC files from 'raw' files.
Make 'mfc' files from 'raw' files using wave2mfcc. Most of the default parameters are standard. here is an example script.

Step2: Create code book:
Gather: Generates vector streams like 'dcep', 'ddcep', 'power' from 'cep' files. Select every kth (stride) vector in each feature stream (demodulation by 'k'). The demodulated streams are dumped into a single 'dmp' file. here is a script.
cluster: Cluster the vectors dumped in 'dmp' using k-means into desired number of clusters, generally 256. Different streams are clustered separately. All streams have same number of clusters. Output is the means(mean) and variances(var) of the clusters. here is a script.

Step3: baum-welch iteration: (making ci-schmm models)
initialization: Make a context independent model definition file ci-mdef, consisting of just the base phones (including filler phones). You also require a template which contains the number of states and initial transition matrix. initialization, using ci-mdef, initializes all transition matrices according to template, it also assigns initial mixture weights for the states in ci-mdef. The output is 'tmat' and 'mixw'.
Baum-welch: Now it's time to run bw. You need dictionary, filler-dict and lsn-file(transcript) (remember, the lsn-file should be in exact order as the 'ctl' file). Each iteration of baum welch consist of two steps, 1st running bw-ci-schmm and getting the reestimated means, vars, mixw and tmat, 2nd updating the original parameter with the reestimated parameters. This is called 'normalization'. continue the iteration till the models converge.

Step4: Force align: (generating a better transcript)
The original transcript may not be a perfect transcript in the sense that it may lack fillers. After some sort of models are made, force-align the reference transcript to get a transcript with fillers. You may want to use this transcript to train further. This step can be performed at any appropriate time, generally after the models converge on initial transcript. After performing this step, bw is again run over this new transcript.

            To make cd-schmm, we should have cd-mdef file. To get an efficient cd-mdef file, we adopt the following path:
            -Make ci-dhmm
            -Make cd-seen-mdef with all the trigrams seen in the lsn file.
            -Train cd-dhmm using cd-seen-mdef.
            -Make decision trees using cd-dhmm, prune trees, tie the states and thus generate cd-tied-mdef. use cd-tied-mdef as cd-mdef for making cd-schmm.

Step5: Making ci-dhmm from ci-schmm.
A ci-dhmm can be simulated just by putting some constraints on ci-schmm. if we run bw-ci-dhmm iterations with topn=1 and no mean and var reestimation, we are simulating bw training of ci-dhmm. Make a copy of 'tmat' and 'mixw', because these are affected by ci-dhmm training. original 'mixw' and 'tmat' are required as a starting point for cd-schmm training(step-7). We set topn=1 and disable mean and var reestimation for bw. We also disable mean and var normalization. Train till models converge.
NOTE: For discrete models, we keep mean and variance reestimation disabled in bw.

Step6: Making context dependent model definition file .
            First make a cd-seen-mdef file with all the triphones that appear in the lsn file. The script to do this is here. lsn2ptx generates the triphone-sequence of the lsn file. Sort the triphones and add the base phones to the list. Now, from this complete triphone list, mk_model_def generates the cd-seen-mdef file.
            NOTE: If we see in the cd-seen-mdef we notice that all contexts of any phone share the same transition matrix.
            cd-dhmm is initialized from ci-dhmm. Now, with new 'mixw', and cd-seen-mdef file, we run bw iterations till the cd-models converge. Remember to keep topn=1 and no mean and variance reestimation.
            Now, we'll tie similar states with same index for all triphones having same base phone. Suppose phone 'p' has 'n' triphones possible, namely (tp1,...,tpn). we tie {tp1(state-i),tp2(state-i),......,tpn(state-i)}. This is done by first making the decision trees, pruning the decision tree and then tying the states using the pruned decision tree. Decision trees are made for each state of a base-phone. e.g. say all the triphones of AA has 5 states, then AA will give 5 decision trees AA-1, AA-2, AA-3, AA-4, AA-5. Pruning prunes all the decision trees based on minimum occurrence of a triphone. Tying generates the cd-tied-mdef file from pruned decision trees. cd-tied-mdef is used as cd-mdef for making cd-schmm.
            haf haf haf .... !!

Now, we pick up where we left. (step4)
Step7: Making cd-schmm.
We make initial cd-schmm from ci-schmm using cd-mdef. The parameters are copied from ci-schmm for cd-schmm using initialization and new tied-cd-mdef file. Now, we run bw iterations with mean and var reestimation (unlike topn=1, no mean and var reestimation for cd-dhmm). Run till the models converge. serve hot, garnish with smiles.

Send comments/suggestions to dbansal@cs.cmu.edu