Number of bits required for floating point representation

Click For Summary

Discussion Overview

The discussion centers on the number of bits required for floating point representation, specifically in the context of numerical analysis. Participants explore the mathematical formulation for encoding floating point numbers, including considerations of base, precision, and exponent ranges.

Discussion Character

  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant quotes a source explaining the structure of floating point representation, including the roles of base ##\beta##, precision ##p##, and exponents ##e_{\max}## and ##e_{\min}##.
  • Another participant challenges the initial calculation of the number of bits required, suggesting that the encoding of the number must be considered, leading to a different formulation involving the number of digits for the significand and exponent.
  • A suggestion is made to refer to the IEEE 754 standard as a specific example of floating point representation.
  • One participant corrects a misunderstanding regarding the maximum value of digits in the significand, clarifying that each digit must be less than the base ##\beta##.
  • A later reply expresses gratitude for the assistance received, indicating a better understanding of the topic.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the correct formulation for the number of bits required for floating point representation, with multiple competing views and interpretations presented throughout the discussion.

Contextual Notes

There are unresolved aspects regarding the assumptions made in the calculations and the specific encoding methods discussed, which may affect the overall understanding of the bit requirements.

CGandC
Messages
326
Reaction score
34
Relevant equations ( a quote from the book A Concise Introduction to Numerical Analysis by A. C. Faul explaining what is floating point representation ):
We live in a continuous world with infinitely many real numbers. However, a computer has only a finite number of bits. This requires an approximate representation. In the past, several different representations of real numbers have been suggested, but now the most widely used by far is the floating point representation. Each floating point representations has a base ##\beta## (which is always assumed to be even) which is typically 2 (binary), 8 (octal), 10 (decimal), or 16 (hexadecimal), and a precision ##p## which is the number of digits (of base ##\beta## ) held in a floating point number. For example, if ##\beta=10## and ##p=5##, the number 0.1 is represented as ##1.0000 \times 10^{-1}##. On the other hand, if ##\beta=2## and ##p=20##, the decimal number 0.1 cannot be represented exactly but is approximately ##1.1001100110011001100 \times 2^{-4}##. We can write the representation as ##\pm d_0 . d_1 \cdots d_{p-1} \times \beta^e##, where ##d_0 . d_1 \cdots d_{p-1}## is called the significand (or mantissa) and has ##p## digits and ##e## is the exponent. If the leading digit ##d_0## is non-zero, the number is said to be normalized. More precisely ##\pm d_0 . d_1 \cdots d_{p-1} \times \beta^e## is the number
##
\pm\left(d_0+d_1 \beta^{-1}+d_2 \beta^{-2}+\cdots+d_{p-1} \beta^{-(p-1)}\right) \beta^e, 0 \leq d_i<\beta
##
__________________________________________________________________________________________________________

I've been reading A Concise Introduction to Numerical Analysis by A. C. Faul and I've been inquiring about the number of bits required to represent a number in floating point representation with base ## \beta ##, precision ## p ## and maximum and minimum exponents ## e_{\max}, e_{\min}##.

Here's the author's calculation:

The largest and smallest allowable exponents are denoted ##e_{\max }## and ##e_{\min }##, respectively. Note that ##e_{\max }## is positive, while ##e_{\min }## is negative. Thus there are ##e_{\max }-e_{\min }+1## possible exponents, the +1 standing for the zero exponent. Since there are ##\beta^p## possible significands, a floating-point number can be encoded in ##\left[\log _2\left(e_{\max }-e_{\min }+1\right)\right]+\left[\log _2\left(\beta^p\right)\right]+1## bits where the final +1 is for the sign bit.

My question: how did the author arrive to ##\left[\log _2\left(e_{\max }-e_{\min }+1\right)\right]+\left[\log _2\left(\beta^p\right)\right]+1## ?

I tried as follows but didn't succeed: the number is ##\pm d_0 \cdot d_1 \cdots d_{p-1} \times \beta^e##, each of ## d_i ## is at most ## \beta ## and since the largest exponent is ## e_{\max} ## then the largest number possible is ## \beta . \beta \cdots \beta \times \beta^{e_{\max}} ##, hence the number of bits is ( add one for plus/minus sign ) ## \lfloor log_2( { \beta^p \cdot \beta^{e_{\max}} }) \rfloor +1 = \lfloor log_2( { \beta^p}) + log_2({ \beta^{e_{\max}} }) \rfloor + 1 ##
 
Computer science news on Phys.org
You have neglected to take account of how the number is actually encoded.

The number <br /> (-1)^S \times d_0.d_1d_2\dots d_{p-1} \times \beta^{e_{\mathrm{min}} + (e - e_{\mathrm{min}})} is encoded as the N_1 + N_2 + 1 digit binary integer <br /> S\;D_1D_2\dots D_{N_2}\;E_1E_2\dots E_{N_1} where S is the sign bit, D_1D_2 \dots D_{N_2} is the binary representation of 0 \leq d_0d_1\dots d_{p-1} \leq \beta^p - 1 and E_1 \dots E_{N_1} is the binary representation of 0 \leq e - e_{\mathrm{min}} \leq e_{\mathrm{max}} - e_{\mathrm{min}}. The maximum integer which can be represented with N binary digits is 1 + 2 + \dots + 2^{N-1} = 2^N - 1 so we require \begin{split}<br /> N_1 &amp;\geq \log_2(e_{\mathrm{max}} - e_{\mathrm{min}} + 1), \\<br /> N_2 &amp;\geq \log_2(\beta^p). \end{split} The minumum necessary numbers of binary digits are therefore \begin{split}<br /> N_1 &amp;= \lceil \log_2(e_{\mathrm{max}} - e_{\mathrm{min}} + 1) \rceil, \\<br /> N_2 &amp;= \lceil \log_2(\beta^p) \rceil, \end{split} where \lceil x \rceil is the smallest integer greater than or equal to x.
 
  • Like
Likes   Reactions: CGandC and Vanadium 50
If a specific example would help, you might look up how IEEE 754 does things.
 
  • Like
Likes   Reactions: berkeman
CGandC said:
I tried as follows but didn't succeed: the number is ##\pm d_0 \cdot d_1 \cdots d_{p-1} \times \beta^e##, each of ## d_i ## is at most ## \beta ##
The "is at most ##\beta##" part is incorrect. Whatever the base ##\beta## is, each digit ##d_i## must be less than ##\beta##.
In base-2, the digits are 0 and 1.
In base-8, the digits are 0, 1, 2, 3, 4, 5, 6, and 7.
In decimal (base-10), the largest digit is 9.
 
  • Like
Likes   Reactions: CGandC
Thanks for help! I understand now.
 
  • Like
Likes   Reactions: berkeman

Similar threads

Replies
4
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
Replies
9
Views
2K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 4 ·
Replies
4
Views
8K
Replies
5
Views
3K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 9 ·
Replies
9
Views
3K