15-451 Algorithms 9/26/2000 D. Sleator * TRIEs * DAWGs * radix sort [Quiz 1 for the first 1/2 hour] ---------------------------------------------------------------------------- Tries ----- Consider the following task. We have a dictionary of words. We want to make a representation of the dictionary that lets us solve the following problem quickly: Given a set of letters, enumerate all the words that can be made with those letters. (For what tasks might this be useful?) We want the algorithm to run in time that is faster than simply enumerating all the words of the dictionary and seeing if they satisfy the constraint. (Splay trees can be used to solve this problem, but the solution we present here is much faster and more space-efficient.) Solution: Tries (TRIE comes from reTRIEval. also called digital search trees). We build a tree where each edge of the tree is labeled with a letter, and each node represents all the words that have as a prefix the path from the root to this node. Here's an example of a trie storing the following words: . car /|\ cars e/d| \c cat / | \ cats . . . do a/ o| a\ dog / | \ dogs . * . done t/ \r n/ \g t/ \r ear * * . * * * ears s| s| e| s| s| s| eat | | | | | | eats * * * * * * The nodes labeled with "*" are those such that the path from the root to that node is a word of the dictionary, and the node labeled with "." are nodes that don't represent a word of the dictionary. One way to implement this is to have an array at each node of 26 pointers (assuming we're only working with letters). This wastes a lot of space, because the typical branching factor is much less than 26. One way to improve this is to store a pair of arrays at each node, where the length of the array is the number of children of the node. One array stores the labels of the edges to children from that node. The other array stores points to those nodes. So to follow, for example the letter "t" from a node, you do binary search in the label array for "t", and if it's there, follow the pointer from the corresponding position in the child pointer array. Note that this data structure is just representing the set of dictionary words (and it doesn't, for example, store any information about the words themselves). In this case we can use a further trick to compress this representation even more. Dawg ---- The above diagram is really just a finite state machine. The arcs are labeled with letters from the input, and the accepting states are simply those labeled with a "*". By allowing ourselves to reuse portiions of the tree (making it into a DAG), it can be substantially further compressed. DAG = Directed Acyclic Graph Dawg = Directed Acyclic Word Graph This is called a Dawg. The Dawg that results form the above set of words has only 8 nodes. (You should be able to construct this yourself). There is an algorithm for constructing the equivalent finite state machine with the minimum number of states. The Dawg uses the same representation as the Trie, and the same algorithms can be run using it. Some numbers: A scrabble dictionary of 94,240 words can be represented as a 117,150 node trie (with 179,618 edges). When converted to a Dawg, the same dictionary has 19,853 nodes (and can be stored in 175K) These representations can be used to write a scrabble program that can find the optimal move in a scrabble position in a fraction of a second. (See "The World's Fastest Scrabble Program", by Appel and Jacobson, JACM, May 1988.) Radix Sort --------- (Read pages 128 and 129 of the Manber.) Describe the old IBM card sorting machines and how they work Here's an implementation of Radix Distribution Sort. it sorts an array a[] of integers, and it uses a temporary array b[] for buffering stuff. #define bits(x,k,j) (((x)>>(k)) & (~(~0<<(j)))) #define w 32 /* The length of the word in bits */ #define m 8 /* numbrer of bits processed per pass */ #define M 256 /* two to the power m */ void dist_radix(int a[], int b[], int N) { int i, j, pass, count[M]; for (pass=0; pass < w; pass +=m) { for (j=0; j=0; i--) b[--count[bits(a[i],pass,m)]] = a[i]; for (i=0; i