Solutions to Example Floating Point Problem

Part I A:
a) Bias = 2^2 - 1 = 3
   E = 1 - Bias = -2

b) When all three frac bits are 1, we have
   1/2 + 1/4 + 1/8 = 7/8

Part I B:
a) We get the smallest value of E when exp = [0 0 1]
   (remember that we can't have exp = [0 0 0] for
   normalized numbers).

   To calculate E, we take the number that the binary
   sequence [0 0 1] represents and subtract the Bias.
   We get E = 1 - Bias = -2

b) We want to make exp as large as possible without
   hitting infinity. We get this when exp = [1 1 0]
   (exp = [1 1 1] would be infinity).

   E = <value of exp in binary> - Bias = 6 - 3 = 3

c) We get the largest value of M when all frac bits
   are 1. We get the fraction:

   1 + 1/2 + 1/4 + 1/8 = 15/8

   The extra 1 comes from the fact that we get a 1 free
   for normalized numbers.


Part II: Since all the numbers (except NaN) are positive,
         the sign bit is always 0.


Zero:
 Zero (positive zero) is encoded by all zeroes.
 Because we have a denormalized number, E = 1 - Bias = -2.

        -2      0       0       0 000 000

Smallest Positive (Nonzero):
 We want to make the number as small as possible,
 so we should leave exp bits set to [0 0 0]. We
 must change a frac bit to 1, though, or
 else we would get the floating point representation
 for 0. Make the least significant frac bit 1.
 
        -2     7/8     1/32     0 000 001

Largest denormalized:
 We want to make the largest number possible while
 still having all exp bits set to 0. Simply set all
 frac bits to 1.

       -2     7/8     7/32     0 000 111

Smallest positive normalized:
 We must increase exp to [0 0 1], or we would have a
 denormalized number. However, we can leave the frac
 bits at 0.

        -2       1     1/4      0 001 000

One:
 Since the smallest normalized number is 1/4 (see above),
 we know that 1 must be normalized. We know that the M
 for a normalized number has a free 1, so we can leave the
 frac bits at [0 0 0]. Now we must find what to use for
 exp. We know that the final floating point number will
 have a value of M * 2^E, and that M = 1. So we need
 2^E = 1 --> E = 0. We know that E = <value of exp> - Bias,
 so we must set <value of exp> to the Bias. The Bias is
 3 = [0 1 1], so exp = [0 1 1].

        0        1      1       0 011 000

Largest finite number:
 We want to make exp and frac as big as possible without having
 exp = [1 1 1] (since this would be infinity). Make all frac
 bits 1. Make exp = [1 1 0].

        3      15/8     15      0 110 111

NaN:
 There are many possible answers. Just set all exp bits to 1
 and set some frac bits to 1.

        -       -      NaN      0 111 111

Infinity:
 Only one way to do this. Set all exp bits to 1 and frac bits
 to 0.
        -       -     infty     0 111 000