Assignment 3

Due: Monday, Feb. 12

In this assignment, you will implement the "degenerate EM" algorithm for determining optimal weights for linearly interpolating a set of language models. "Optimality" is defined relative to a common held-out set.

Here are 3 "mystery language models", PA, PB, PC.  In fact, you are not given the models, but only their values (i.e. probability stream) on a common held-out set:  STREAMA, STREAMB, STREAMC. The files are in the following format: the i'th line contains one single floating point number which is the estimate the model assigns to the ith word.

(0)  Write a program that reads the probability streams and a set of initial weights, and runs the algorithm to convergence.  Convergence can be defined with respect to the weights themselves, or the likelihood of the held-out set according to the combined model.  Use the latter.  Specifically, keep track of the average (per word) log-likelihood of the held-out set according to the combined model, and stop when the change is less than 0.1%.    A detailed description of the algorithm appears here. This algorithm is efficient enough to be implemented in perl or matlab.

(1)  What if any of the models assigns zero to any of the words?  Under what conditions, if any, is this a problem?

(2)  Start from a (truly) randomly chosen set of weightsA,B,C (make sure they sum to one: choose 3 random points uniformly between 0 and 1, then normalize them).  Run your algorithm to convergence.  For each iteration, print the weights, the average log-likelihood, and the ratio of the last two avg-log-likelihoods.  Watch that the average log-likelihood grows with each iteration!  This is mathematically guaranteed, and can be used as a "sanity check".

(3) Start again from another random starting point, and compare the final outcome.

(4) These mystery models (as you may have guessed) happen to be ML-trained 1gram, 2gram and 3gram (not necessarily in this order!), trained on a common training set.  Note that they are not smoothed!
    - Try to guess, from the optimal weights alone, which model is which.
    - Now look at the probability streams themselves, and guess again.  Were you correct the first time? 

(5) What do you expect would happen is you added the uniform model as a 4th component?  Try it (V=4000).   What actually happened?

(6)  What do you think will happen if the same model is represented twice in the interpolation? Try this with STREAMA+STREAMB+STREAMA.  Try several different initial weights.

The following two questions are not about the optimal weights or the algorithm:

(7*)  Plot a 3D graph of average log-likelihood as a function of A, B. Since the lambdas are constrained to sum to one and to lie in the range [0,1], the domain of the graph (x and y axis) should be the simplex triangle: . A description how to do this in Matlab will be appear on the bboard.

(8*) Prove that, when interpolating only two models (in this case there is only one free variable), the average log-likelihood is a convex function of that variable.

 Submit your code, plus answers to (1--8)

------------------------------------------ 

*: [if needed, can turn these in by class time Wednesday, 2/14]