- #1

CGandC

- 326

- 34

*A Concise Introduction to Numerical Analysis by A. C. Faul*explaining what is

*floating point representation*):

__________________________________________________________________________________________________________We live in a continuous world with infinitely many real numbers. However, a computer has only a finite number of bits. This requires an approximate representation. In the past, several different representations of real numbers have been suggested, but now the most widely used by far is the floating point representation. Each floating point representations has a base ##\beta## (which is always assumed to be even) which is typically 2 (binary), 8 (octal), 10 (decimal), or 16 (hexadecimal), and a precision ##p## which is the number of digits (of base ##\beta## ) held in a floating point number. For example, if ##\beta=10## and ##p=5##, the number 0.1 is represented as ##1.0000 \times 10^{-1}##. On the other hand, if ##\beta=2## and ##p=20##, the decimal number 0.1 cannot be represented exactly but is approximately ##1.1001100110011001100 \times 2^{-4}##. We can write the representation as ##\pm d_0 . d_1 \cdots d_{p-1} \times \beta^e##, where ##d_0 . d_1 \cdots d_{p-1}## is called the significand (or mantissa) and has ##p## digits and ##e## is the exponent. If the leading digit ##d_0## is non-zero, the number is said to be normalized. More precisely ##\pm d_0 . d_1 \cdots d_{p-1} \times \beta^e## is the number

##

\pm\left(d_0+d_1 \beta^{-1}+d_2 \beta^{-2}+\cdots+d_{p-1} \beta^{-(p-1)}\right) \beta^e, 0 \leq d_i<\beta

##

I've been reading

*A Concise Introduction to Numerical Analysis by A. C. Faul*and I've been inquiring about the number of bits required to represent a number in floating point representation with base ## \beta ##, precision ## p ## and maximum and minimum exponents ## e_{\max}, e_{\min}##.

Here's the author's calculation:

The largest and smallest allowable exponents are denoted ##e_{\max }## and ##e_{\min }##, respectively. Note that ##e_{\max }## is positive, while ##e_{\min }## is negative. Thus there are ##e_{\max }-e_{\min }+1## possible exponents, the +1 standing for the zero exponent. Since there are ##\beta^p## possible significands, a floating-point number can be encoded in ##\left[\log _2\left(e_{\max }-e_{\min }+1\right)\right]+\left[\log _2\left(\beta^p\right)\right]+1## bits where the final +1 is for the sign bit.

**My question:**how did the author arrive to ##\left[\log _2\left(e_{\max }-e_{\min }+1\right)\right]+\left[\log _2\left(\beta^p\right)\right]+1## ?

I tried as follows but didn't succeed: the number is ##\pm d_0 \cdot d_1 \cdots d_{p-1} \times \beta^e##, each of ## d_i ## is at most ## \beta ## and since the largest exponent is ## e_{\max} ## then the largest number possible is ## \beta . \beta \cdots \beta \times \beta^{e_{\max}} ##, hence the number of bits is ( add one for plus/minus sign ) ## \lfloor log_2( { \beta^p \cdot \beta^{e_{\max}} }) \rfloor +1 = \lfloor log_2( { \beta^p}) + log_2({ \beta^{e_{\max}} }) \rfloor + 1 ##