IEEE Floating Point, 64 Bit: How are Exponents Translated into Ranges?

In summary: It's a bit ungainly in computers. One problem: This scheme has a positive zero and a negative zero. This means special logic is needed to deal with the fact that these bitwise unequal values represent the same number.
  • #1
STEMucator
Homework Helper
2,076
140

Homework Statement



This is not a homework problem, but rather a concern I had while reading. Not sure where I should've put this thread.

Homework Equations


The Attempt at a Solution



I was reading about how integers/floating point numbers were stored in a computer. For an n-bit word, the range of values stored would be from ##[-2^{n-1}, 2^{n-1} - 1]##.

The MATLAB environment follows the IEEE double-precision format specification where 8 bytes (or 64 bits) are used to represent floating point numbers. The first bit is used for the sign of the whole number. 11 bits are used for the exponent, 1 for the sign and 10 for the exponent itself. 52 bits are set aside for the mantissa.

Would this mean the 11 bits used for the exponent translate into a range from ##[-1024, 1023]##? The book seems to list it as ##[-1022, 1023]## , though I'm not sure why.

The largest positive real number, smallest positive real number and the machine epsilon would make much more sense if I could sort that bit out.
 
Physics news on Phys.org
  • #2
The book is correct. The two exponent values of all-zero-bits and all-one-bits have special meanings (denormalized numbers and NaNs), so the range of "normal" exponents is 2 less than you might expect.
 
  • #3
AlephZero said:
The book is correct. The two exponent values of all-zero-bits and all-one-bits have special meanings (denormalized numbers and NaNs), so the range of "normal" exponents is 2 less than you might expect.

So I'm assuming ##-1024## and ##-1023## are reserved for those purposes?
 
  • #4
Zondrina said:
So I'm assuming ##-1024## and ##-1023## are reserved for those purposes?

It doesn't really make sense to convert the reserved bit patterns like that.

The 11-bit exponent is interpreted as an unsigned integer and converted to an exponent by subtracting 1023. That gives exponents from ##2^{1-1023}## to ##2^{2046-1023}##, i.e. ##2^{-1022}## to ##2^{+1023}##.

The all zeros bit pattern would convert to an exponent of ##2^{-1023}##, but it is used for denormalized numbers that are easiest to interpret using the exponent ##2^{-1022}## not ##2^{-1023}##.

The all ones bit pattern would convert to ##2^{+1024}## (not your ##2^{-1024}##), but the idea of "an exponent" is meaningless for NaN's. They represent the comcept of "numbers" like ##\pm\infty## and ##0/0##, not actual numbers with a definite numerical value.
 
  • #5
The 11-bit exponent is interpreted as an unsigned integer and converted to an exponent by subtracting 1023.

So an 11 bit exponent has the form: ##\pm bbbbbbbbbb## where each ##b## is a bit (10 bits).

You're telling me the binary exponents ##0000000000## (sub-normal number) and ##1111111111## (NaN) are interpreted as unsigned integers.

##1111111111 = 1 \times 2^9 + 1 \times 2^8 + 1 \times 2^7 + 1 \times 2^6 + 1 \times 2^5 + 1 \times 2^4 + 1 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 1 \times 2^0 = 1023##

##0000000000 = 0##

Subtracting from the all ones I get zero.

Subtracting from the all zero I get -1023 as expected.

I'm still slightly confused.
 
  • #6
No, the 11 bits are interpreted as an unsigned number, so it ranges from 0 to 2047. An exponent field of 0 corresponds to either a signed floating-point 0 or a subnormal number, depending on the value of the significand. An exponent field of 2047 corresponds to either infinity or NaN, again depending on the value of the significand.

If it's neither of those two reserved values, then you subtract the exponent bias of 1023 to get the actual value of the exponent.
 
  • Like
Likes 1 person
  • #7
Zondrina said:
I'm still slightly confused.
Your confusion appears to arise from your reading of how integers themselves are represented. Your book apparently presented the concept of two's complement representation, as if that is the only possible approach. That's not true. There are a number of ways to represent integers.
  • Unsigned integers. There is no such thing as a negative number in this representation. A 32 bit unsigned integer can represent integers between 0 and 232-1. This representation is still widely used.

  • Signed integers. Modern computers can interpret a value as either an unsigned integer or a signed integer. The underlying machine language has instructions that support both concepts. Unsigned integers are easy, but what about signed integers? There are multiple ways to do that. Your book only showed one of them.

    • Signed magnitude. This scheme is the easiest to understand. One bit represents the sign of the number, and the remaining bits represent the magnitude. This is the natural representation scheme to humans. It's a bit ungainly in computers. One problem: This scheme has a positive zero and a negative zero. This means special logic is needed to deal with the fact that these bitwise unequal values represent the same number.

    • Two's complement. This is now the most widely used scheme to represent signed integers. A negative integer -n where n>0 is represented as the unsigned integer 2N-n, where N is the number of bits in the representation.

    • One's complement. This used to be a widely used scheme. It has fallen by the wayside to two's complement. A negative number -n where n≥0 is represented as an unsigned integer where each bit in the unsigned representation of n is complemented (a 0 becomes a 1, a 1 becomes a 0). One downside of one's complement is that this scheme, like signed magnitude, has a positive and a negative zero.

    • Offset N. This is a much less used scheme. One of the last vestiges of this technique is the representation of the exponent in the IEEE floating point format. The concept is simple: For a given number n, it's representation on the computer is n+N, where N is some predetermined offset value. To go from the representation to the actual value, simply subtract the offset.

As noted above, the exponent in the IEEE floating point format uses the offset-N representation scheme. Your confusion arises from your assumption of two's complement.
 
  • #8
Thank you vela and D.H for the replies.

Yes D.H, the book is trying to show several different ways to do it, including the 2's complement. The book originally claims the floating point representation is:

$$\pm s \times b ^e$$

After an example, it talks about normalization and so we can re-write the non-zero floating point numbers as:

$$\pm (1 + f) \times 2^e$$

Then it talked about the IEEE standard where 64 bits are used. With normalization, 53 bits are stored in the mantissa instead of 52. Then it goes on to talk about the limited range and precision of the numbers:

Screen Shot 2014-09-08 at 9.12.49 AM.jpg


Screen Shot 2014-09-08 at 9.13.28 AM.jpg


This is where my confusion arised, namely with the way they derived the maximum range before overflow. The 11 bits used for the exponent has a range I didn't quite understand.

So for the two's complement, say with ##N = 64 \space bits## and ##n = -4123##, the representation would be ##2^{64} - 4123##?

Also, the offset N convention is responsible for the exponent range?
 
  • #9
Forget two's complement here. You are clinging to it, and that is erroneous. It plays no role in the IEEE floating format.

Your book is wrong, but not with regard to the largest representable floating point value. The problem is the smallest. It ignores the denormalized numbers.

The exponent of a 64 bit floating point comprises 11 bits. Interpreted as a raw unsigned integer, the exponent e takes on integer values between 0 to 2047 inclusive. Following is how those values are interpreted:
  • e=2047.
    This special value is reserved for representing infinities and non-numbers ("not a number", or NaN). Hopefully your text discusses these concepts.
  • 1≤e≤2046.
    This range of exponents is used for normalized numbers. The corresponding real value is ##1.b_0b_1...b_{51}\times 2^{e-1023}##. In other words, the 52 bits that comprise the mantissa are interpreted as the binary fractional part of the base two equivalent of scientific notation. An implied leading 1 precedes this fractional part. The resulting number is then multiplied by ##2^{e-1023}##.
  • e=0.
    This special value is reserved for representing denormalized numbers. These are conceptually similar to the normalized numbers, but with two changes.
    1. The leading implied one becomes a leading implied zero.
    2. The exponent is -1022 rather than -1023 as would be implied by the expression ##e-1023##.

This means the largest representable positive number has an exponent e=2046 and a mantissa of all ones. This corresponds to ##(2-2^{-52}) \times 2^{1023}##, or about ##1.7976931348623157 \times 10^{308}##. The smallest possible normalized positive number is ##1 \times 2^{-1022}##, or about ##2.2250738585072014 \times 10^{-308}##. This is not the smallest possible representable positive number. For that you need to look to the denormalized numbers. The smallest representable number has an exponent of zero and a mantissa that is all zeros except for the least significant bit. This has a value of ##1 \times 2^{-1074}##, or about ##4.9406564584124654 \times 10^{-324}##.
 
  • Like
Likes 1 person
  • #10
To see the difference between interpreting the exponent as an unsigned 11-bit integer with an offset of 1023 (i.e. 210-1), and your wrong idea of interpreting it as a signed integer, look at the difference with a 3-bit exponent and an offset of 3 (i.e. 22 - 1).

Code:
Bit pattern Unsigned integer  Exponent  Signed integer
000 Special interpretation (denormalized number)
001           1               1-3 = -2      1 
010           2               2-3 = -1      2
011           3               3-3 =  0      3
100           4               4-3 =  1     -4
101           5               5-3 =  2     -3
110           6               6-3 =  3     -2
111  Special interpretation (NaN)
The bit patterns of the special values of all-zeros and all-ones would represent 0 and -1 as signed numbers, not values at the ends of the range, i.e. -3 and +2 for a 3-bit exponent, or -1023 and +1024 for an 11-bit exponent.
 
  • Like
Likes 1 person
  • #11
D H said:
Forget two's complement here. You are clinging to it, and that is erroneous. It plays no role in the IEEE floating format.

Your book is wrong, but not with regard to the largest representable floating point value. The problem is the smallest. It ignores the denormalized numbers.

The exponent of a 64 bit floating point comprises 11 bits. Interpreted as a raw unsigned integer, the exponent e takes on integer values between 0 to 2047 inclusive. Following is how those values are interpreted:
  • e=2047.
    This special value is reserved for representing infinities and non-numbers ("not a number", or NaN). Hopefully your text discusses these concepts.

  • 1≤e≤2046.
    This range of exponents is used for normalized numbers. The corresponding real value is ##1.b_0b_1...b_{51}\times 2^{e-1023}##. In other words, the 52 bits that comprise the mantissa are interpreted as the binary fractional part of the base two equivalent of scientific notation. An implied leading 1 precedes this fractional part. The resulting number is then multiplied by ##2^{e-1023}##.

  • e=0.
    This special value is reserved for representing denormalized numbers. These are conceptually similar to the normalized numbers, but with two changes.
    1. The leading implied one becomes a leading implied zero.
    2. The exponent is -1022 rather than -1023 as would be implied by the expression ##e-1023##.

This means the largest representable positive number has an exponent e=2046 and a mantissa of all ones. This corresponds to ##(2-2^{-52}) \times 2^{1023}##, or about ##1.7976931348623157 \times 10^{308}##. The smallest possible normalized positive number is ##1 \times 2^{-1022}##, or about ##2.2250738585072014 \times 10^{-308}##. This is not the smallest possible representable positive number. For that you need to look to the denormalized numbers. The smallest representable number has an exponent of zero and a mantissa that is all zeros except for the least significant bit. This has a value of ##1 \times 2^{-1074}##, or about ##4.9406564584124654 \times 10^{-324}##.

Thank you very much for this post. Everything is much more clear now. I see how the exponents work and why ##e = 0## and ##e = 2047## are special.

AlephZero said:
To see the difference between interpreting the exponent as an unsigned 11-bit integer with an offset of 1023 (i.e. 210-1), and your wrong idea of interpreting it as a signed integer, look at the difference with a 3-bit exponent and an offset of 3 (i.e. 22 - 1).

Code:
Bit pattern Unsigned integer  Exponent  Signed integer
000 Special interpretation (denormalized number)
001           1               1-3 = -2      1 
010           2               2-3 = -1      2
011           3               3-3 =  0      3
100           4               4-3 =  1     -4
101           5               5-3 =  2     -3
110           6               6-3 =  3     -2
111  Special interpretation (NaN)

The bit patterns of the special values of all-zeros and all-ones would represent 0 and -1 as signed numbers, not values at the ends of the range, i.e. -3 and +2 for a 3-bit exponent, or -1023 and +1024 for an 11-bit exponent.

This example also made it even more clear. Thank you for writing a program that demonstrates what is happening, it's always much easier to learn from an example.
 

1. What is IEEE Floating point, 64 Bit?

IEEE Floating point, 64 Bit is a standard for representing and performing calculations on floating point numbers in computer systems. It is a format that stores numbers with a sign bit, an exponent, and a significand, allowing for a wide range of values and precision.

2. How is IEEE Floating point, 64 Bit different from other floating point formats?

IEEE Floating point, 64 Bit is different from other floating point formats in terms of its precision and range. It uses 64 bits to store numbers, allowing for a larger range of values and greater precision compared to other formats like 32-bit or 16-bit.

3. What advantages does IEEE Floating point, 64 Bit have over other formats?

IEEE Floating point, 64 Bit has several advantages over other formats. It allows for a wider range of values and greater precision, making it suitable for scientific and engineering calculations. It also follows a standardized format, ensuring consistency across different computer systems.

4. What are some potential drawbacks of using IEEE Floating point, 64 Bit?

One potential drawback of using IEEE Floating point, 64 Bit is that it requires more memory compared to other formats. This can be a concern for systems with limited memory. Additionally, it may not be suitable for applications that require exact precision, as there can be rounding errors in calculations.

5. How is IEEE Floating point, 64 Bit used in real-world applications?

IEEE Floating point, 64 Bit is commonly used in scientific and engineering applications that require a wide range of values and precision, such as in simulations, data analysis, and mathematical modeling. It is also used in computer graphics and gaming to represent and manipulate 3D objects and environments.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
9
Views
956
  • Computing and Technology
Replies
4
Views
769
Replies
4
Views
932
  • Programming and Computer Science
Replies
32
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
2K
Replies
6
Views
9K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
5K
  • Engineering and Comp Sci Homework Help
Replies
2
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
14
Views
3K
  • Programming and Computer Science
Replies
2
Views
2K
Back
Top