# Homework Help: IEEE Floating point, 64 Bit

1. Sep 7, 2014

### Zondrina

1. The problem statement, all variables and given/known data

This is not a homework problem, but rather a concern I had while reading. Not sure where I should've put this thread.

2. Relevant equations

3. The attempt at a solution

I was reading about how integers/floating point numbers were stored in a computer. For an n-bit word, the range of values stored would be from $[-2^{n-1}, 2^{n-1} - 1]$.

The MATLAB environment follows the IEEE double-precision format specification where 8 bytes (or 64 bits) are used to represent floating point numbers. The first bit is used for the sign of the whole number. 11 bits are used for the exponent, 1 for the sign and 10 for the exponent itself. 52 bits are set aside for the mantissa.

Would this mean the 11 bits used for the exponent translate into a range from $[-1024, 1023]$? The book seems to list it as $[-1022, 1023]$ , though I'm not sure why.

The largest positive real number, smallest positive real number and the machine epsilon would make much more sense if I could sort that bit out.

2. Sep 7, 2014

### AlephZero

The book is correct. The two exponent values of all-zero-bits and all-one-bits have special meanings (denormalized numbers and NaNs), so the range of "normal" exponents is 2 less than you might expect.

3. Sep 7, 2014

### Zondrina

So I'm assuming $-1024$ and $-1023$ are reserved for those purposes?

4. Sep 7, 2014

### AlephZero

It doesn't really make sense to convert the reserved bit patterns like that.

The 11-bit exponent is interpreted as an unsigned integer and converted to an exponent by subtracting 1023. That gives exponents from $2^{1-1023}$ to $2^{2046-1023}$, i.e. $2^{-1022}$ to $2^{+1023}$.

The all zeros bit pattern would convert to an exponent of $2^{-1023}$, but it is used for denormalized numbers that are easiest to interpret using the exponent $2^{-1022}$ not $2^{-1023}$.

The all ones bit pattern would convert to $2^{+1024}$ (not your $2^{-1024}$), but the idea of "an exponent" is meaningless for NaN's. They represent the comcept of "numbers" like $\pm\infty$ and $0/0$, not actual numbers with a definite numerical value.

5. Sep 7, 2014

### Zondrina

So an 11 bit exponent has the form: $\pm bbbbbbbbbb$ where each $b$ is a bit (10 bits).

You're telling me the binary exponents $0000000000$ (sub-normal number) and $1111111111$ (NaN) are interpreted as unsigned integers.

$1111111111 = 1 \times 2^9 + 1 \times 2^8 + 1 \times 2^7 + 1 \times 2^6 + 1 \times 2^5 + 1 \times 2^4 + 1 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 1 \times 2^0 = 1023$

$0000000000 = 0$

Subtracting from the all ones I get zero.

Subtracting from the all zero I get -1023 as expected.

I'm still slightly confused.

6. Sep 7, 2014

### vela

Staff Emeritus
No, the 11 bits are interpreted as an unsigned number, so it ranges from 0 to 2047. An exponent field of 0 corresponds to either a signed floating-point 0 or a subnormal number, depending on the value of the significand. An exponent field of 2047 corresponds to either infinity or NaN, again depending on the value of the significand.

If it's neither of those two reserved values, then you subtract the exponent bias of 1023 to get the actual value of the exponent.

7. Sep 8, 2014

### D H

Staff Emeritus
Your confusion appears to arise from your reading of how integers themselves are represented. Your book apparently presented the concept of two's complement representation, as if that is the only possible approach. That's not true. There are a number of ways to represent integers.
• Unsigned integers. There is no such thing as a negative number in this representation. A 32 bit unsigned integer can represent integers between 0 and 232-1. This representation is still widely used.

• Signed integers. Modern computers can interpret a value as either an unsigned integer or a signed integer. The underlying machine language has instructions that support both concepts. Unsigned integers are easy, but what about signed integers? There are multiple ways to do that. Your book only showed one of them.

• Signed magnitude. This scheme is the easiest to understand. One bit represents the sign of the number, and the remaining bits represent the magnitude. This is the natural representation scheme to humans. It's a bit ungainly in computers. One problem: This scheme has a positive zero and a negative zero. This means special logic is needed to deal with the fact that these bitwise unequal values represent the same number.

• Two's complement. This is now the most widely used scheme to represent signed integers. A negative integer -n where n>0 is represented as the unsigned integer 2N-n, where N is the number of bits in the representation.

• One's complement. This used to be a widely used scheme. It has fallen by the wayside to two's complement. A negative number -n where n≥0 is represented as an unsigned integer where each bit in the unsigned representation of n is complemented (a 0 becomes a 1, a 1 becomes a 0). One downside of one's complement is that this scheme, like signed magnitude, has a positive and a negative zero.

• Offset N. This is a much less used scheme. One of the last vestiges of this technique is the representation of the exponent in the IEEE floating point format. The concept is simple: For a given number n, it's representation on the computer is n+N, where N is some predetermined offset value. To go from the representation to the actual value, simply subtract the offset.

As noted above, the exponent in the IEEE floating point format uses the offset-N representation scheme. Your confusion arises from your assumption of two's complement.

8. Sep 8, 2014

### Zondrina

Thank you vela and D.H for the replies.

Yes D.H, the book is trying to show several different ways to do it, including the 2's complement. The book originally claims the floating point representation is:

$$\pm s \times b ^e$$

After an example, it talks about normalization and so we can re-write the non-zero floating point numbers as:

$$\pm (1 + f) \times 2^e$$

Then it talked about the IEEE standard where 64 bits are used. With normalization, 53 bits are stored in the mantissa instead of 52. Then it goes on to talk about the limited range and precision of the numbers:

This is where my confusion arised, namely with the way they derived the maximum range before overflow. The 11 bits used for the exponent has a range I didn't quite understand.

So for the two's complement, say with $N = 64 \space bits$ and $n = -4123$, the representation would be $2^{64} - 4123$?

Also, the offset N convention is responsible for the exponent range?

9. Sep 8, 2014

### D H

Staff Emeritus
Forget two's complement here. You are clinging to it, and that is erroneous. It plays no role in the IEEE floating format.

Your book is wrong, but not with regard to the largest representable floating point value. The problem is the smallest. It ignores the denormalized numbers.

The exponent of a 64 bit floating point comprises 11 bits. Interpreted as a raw unsigned integer, the exponent e takes on integer values between 0 to 2047 inclusive. Following is how those values are interpreted:
• e=2047.
This special value is reserved for representing infinities and non-numbers ("not a number", or NaN). Hopefully your text discusses these concepts.

• 1≤e≤2046.
This range of exponents is used for normalized numbers. The corresponding real value is $1.b_0b_1...b_{51}\times 2^{e-1023}$. In other words, the 52 bits that comprise the mantissa are interpreted as the binary fractional part of the base two equivalent of scientific notation. An implied leading 1 precedes this fractional part. The resulting number is then multiplied by $2^{e-1023}$.

• e=0.
This special value is reserved for representing denormalized numbers. These are conceptually similar to the normalized numbers, but with two changes.
2. The exponent is -1022 rather than -1023 as would be implied by the expression $e-1023$.

This means the largest representable positive number has an exponent e=2046 and a mantissa of all ones. This corresponds to $(2-2^{-52}) \times 2^{1023}$, or about $1.7976931348623157 \times 10^{308}$. The smallest possible normalized positive number is $1 \times 2^{-1022}$, or about $2.2250738585072014 \times 10^{-308}$. This is not the smallest possible representable positive number. For that you need to look to the denormalized numbers. The smallest representable number has an exponent of zero and a mantissa that is all zeros except for the least significant bit. This has a value of $1 \times 2^{-1074}$, or about $4.9406564584124654 \times 10^{-324}$.

10. Sep 8, 2014

### AlephZero

To see the difference between interpreting the exponent as an unsigned 11-bit integer with an offset of 1023 (i.e. 210-1), and your wrong idea of interpreting it as a signed integer, look at the difference with a 3-bit exponent and an offset of 3 (i.e. 22 - 1).

Code (Text):

Bit pattern Unsigned integer  Exponent  Signed integer
000 Special interpretation (denormalized number)
001           1               1-3 = -2      1
010           2               2-3 = -1      2
011           3               3-3 =  0      3
100           4               4-3 =  1     -4
101           5               5-3 =  2     -3
110           6               6-3 =  3     -2
111  Special interpretation (NaN)

The bit patterns of the special values of all-zeros and all-ones would represent 0 and -1 as signed numbers, not values at the ends of the range, i.e. -3 and +2 for a 3-bit exponent, or -1023 and +1024 for an 11-bit exponent.

11. Sep 8, 2014

### Zondrina

Thank you very much for this post. Everything is much more clear now. I see how the exponents work and why $e = 0$ and $e = 2047$ are special.

This example also made it even more clear. Thank you for writing a program that demonstrates what is happening, it's always much easier to learn from an example.