The
binaries mentioned in some of the script can be found at /net/alf19/usr2/eht/s3/
in cmu speech domain.
The
files correspond to the training of Haitian-Creole acoustic
models used in DIPLOMAT.
Step1:
Make MFC files from 'raw' files.
Make 'mfc' files from 'raw' files using wave2mfcc.
Most of the default parameters are standard. here
is an example script.
Step2:
Create
code book:
Gather:
Generates vector streams like 'dcep', 'ddcep', 'power' from
'cep' files. Select every kth (stride) vector in each feature stream (demodulation
by 'k'). The demodulated streams are dumped into a single 'dmp'
file. here is a script.
cluster:
Cluster the vectors dumped in 'dmp' using k-means into desired number of
clusters, generally 256. Different streams are clustered separately. All
streams have same number of clusters. Output is the means(mean) and variances(var)
of the clusters. here is a script.
Step3:
baum-welch iteration:
(making ci-schmm models)
initialization: Make a context independent model definition
file ci-mdef, consisting of just the base phones
(including filler phones). You also require a template
which contains the number of states and initial transition matrix.
initialization, using ci-mdef, initializes
all transition matrices according to template, it also assigns initial
mixture weights for the states in ci-mdef. The output is 'tmat' and 'mixw'.
Baum-welch: Now it's time to run bw. You need
dictionary, filler-dict
and lsn-file(transcript) (remember,
the lsn-file should be in exact order as the 'ctl'
file). Each iteration of baum welch consist of two steps, 1st running bw-ci-schmm
and getting the reestimated means, vars, mixw and tmat, 2nd
updating the original parameter with the reestimated parameters.
This is called 'normalization'. continue the
iteration till the models converge.
Step4:
Force align:
(generating a better transcript)
The original transcript may not be a perfect transcript in the sense that
it may lack fillers. After some sort of models are made, force-align
the reference transcript to get a transcript with fillers. You may want
to use this transcript to train further. This step can be performed
at any appropriate time, generally after the models converge on initial
transcript. After performing this step, bw is again run over this new transcript.
To make cd-schmm, we should have cd-mdef file. To get an efficient
cd-mdef file, we adopt the following path:
-Make ci-dhmm
-Make cd-seen-mdef with all the trigrams seen in the lsn file.
-Train cd-dhmm using cd-seen-mdef.
-Make decision trees using cd-dhmm, prune trees, tie the states and thus
generate cd-tied-mdef. use cd-tied-mdef as cd-mdef for making cd-schmm.
Step6:
Making context dependent model definition file
.
First make a cd-seen-mdef file with all the triphones that appear
in the lsn file. The script to do this is here.
lsn2ptx generates the triphone-sequence of
the lsn file. Sort the triphones and add the base phones to the list.
Now, from this complete triphone list,
mk_model_def generates the cd-seen-mdef
file.
NOTE: If we see in the cd-seen-mdef we notice that all contexts of any
phone share the same transition matrix.
cd-dhmm is initialized from ci-dhmm.
Now, with new 'mixw', and cd-seen-mdef file, we run bw iterations till
the cd-models converge. Remember to keep topn=1 and no mean and variance
reestimation.
Now, we'll tie similar states with same index for all triphones having
same base phone. Suppose phone 'p' has 'n' triphones possible, namely (tp1,...,tpn).
we tie {tp1(state-i),tp2(state-i),......,tpn(state-i)}. This is done by
first making the decision trees, pruning
the decision tree and then tying the states
using the pruned decision tree. Decision trees are made for each state
of a base-phone. e.g. say all the triphones of AA has 5 states, then
AA will give 5 decision trees AA-1, AA-2, AA-3, AA-4, AA-5. Pruning
prunes all the decision trees based on minimum occurrence of a triphone.
Tying generates the cd-tied-mdef file from pruned decision trees.
cd-tied-mdef is used as cd-mdef for making cd-schmm.
haf haf haf .... !!