SPHINX-II   ACOUSTIC    TRAINING    STEPS

The binaries mentioned in some of the script can be found at   /net/alf19/usr2/eht/s3/  in cmu speech domain.
The files correspond to the training of Haitian-Creole  acoustic models used in DIPLOMAT.

Step1:  Make MFC files from 'raw' files.
             Make 'mfc' files from 'raw' files using wave2mfcc.  Most of the default parameters are standard. here is an example script.

Step2:  Create code book:
             Gather:   Generates  vector streams  like 'dcep', 'ddcep', 'power' from 'cep' files. Select every kth (stride) vector in each feature stream (demodulation by 'k').  The demodulated streams are dumped into a single 'dmp'  file. here is a script.
            cluster:  Cluster the vectors dumped in 'dmp' using k-means into desired number of clusters, generally 256. Different streams are clustered separately. All streams have same number of clusters. Output is the means(mean) and variances(var) of the clusters. here is a script.

Step3: baum-welch iteration: (making ci-schmm models)
           initialization:  Make a context independent model definition file ci-mdef, consisting of just the base  phones (including filler phones). You also require a template which contains the number of states and initial transition matrix.  initialization, using ci-mdef, initializes all transition matrices according to template, it also assigns initial mixture weights for the states in ci-mdef. The output is 'tmat' and 'mixw'.
          Baum-welch:  Now it's time to run bw.  You need  dictionaryfiller-dict and lsn-file(transcript)  (remember, the lsn-file should be in exact order as the 'ctl' file). Each iteration of baum welch consist of two steps, 1st running bw-ci-schmm and getting the reestimated  means, vars, mixw and tmat,  2nd updating  the original parameter with the reestimated parameters. This is called 'normalization'. continue the iteration till the models converge.

Step4:  Force align:  (generating a better transcript)
            The original transcript may not be a perfect transcript in the sense that it may lack fillers. After some sort of models are made,  force-align  the reference transcript to get a transcript with fillers. You may want to use this transcript to train further.  This step can be performed at any appropriate time, generally after the models converge on initial transcript. After performing this step, bw is again run over this new transcript.

            To make cd-schmm, we should have cd-mdef  file. To get an efficient cd-mdef file, we adopt  the following  path:
            -Make ci-dhmm
            -Make cd-seen-mdef with all the trigrams seen in the lsn file.
            -Train cd-dhmm using cd-seen-mdef.
            -Make decision trees using cd-dhmm, prune trees, tie the states and thus generate cd-tied-mdef. use cd-tied-mdef as cd-mdef for making cd-schmm.


Step5:  Making ci-dhmm from ci-schmm.
            A ci-dhmm can be simulated just by putting some constraints on ci-schmm. if we run bw-ci-dhmm  iterations with topn=1 and no mean and var reestimation, we are simulating bw training of ci-dhmm. Make a copy of  'tmat' and 'mixw', because these are affected by ci-dhmm training. original 'mixw' and 'tmat' are required as a starting point for cd-schmm training(step-7). We set  topn=1 and disable mean and var reestimation for bw. We also disable mean and var normalization. Train till models converge.
            NOTE:  For discrete models, we keep mean and variance reestimation disabled in bw.

Step6:  Making context dependent model definition file .
            First make a cd-seen-mdef  file with all the triphones that appear in the lsn file. The script to do this is  here.  lsn2ptx  generates the triphone-sequence of the lsn file.  Sort the triphones and add the base phones to the list.  Now, from this complete triphone list,  mk_model_def  generates the cd-seen-mdef  file.
            NOTE: If we see in the cd-seen-mdef we notice that all contexts of any phone share the same transition matrix.
            cd-dhmm is  initialized  from ci-dhmm.  Now, with new 'mixw', and cd-seen-mdef file, we run bw iterations till the cd-models converge. Remember to keep topn=1 and no mean and variance reestimation.
            Now, we'll tie similar states with same index for all triphones having same base phone. Suppose phone 'p' has 'n' triphones possible, namely (tp1,...,tpn). we tie {tp1(state-i),tp2(state-i),......,tpn(state-i)}. This is done by first making the decision treespruning the decision tree and then  tying the states using the pruned decision tree. Decision trees are made for each state of a base-phone.  e.g. say all the triphones of AA has 5 states, then AA will give 5 decision trees AA-1, AA-2, AA-3, AA-4, AA-5. Pruning  prunes all the decision trees based on minimum occurrence of a triphone. Tying generates the cd-tied-mdef  file from pruned decision trees. cd-tied-mdef is used as cd-mdef for making cd-schmm.
            haf  haf  haf  .... !!


Now, we pick up where we left. (step4)
Step7:  Making cd-schmm.
            We make initial cd-schmm from ci-schmm using cd-mdef. The parameters are copied from ci-schmm for cd-schmm using initialization and new tied-cd-mdef file.  Now, we run bw iterations with mean and var reestimation (unlike topn=1, no mean and var reestimation for cd-dhmm). Run till the models converge. serve hot, garnish with smiles.


Send comments/suggestions to      dbansal@cs.cmu.edu