Floating Point Explained

As an extension to any internal documentation of PDCLib, and because being able to explain a subject is a good test whether I myself have understood the subject, I'll give an introduction to floating point format here.

Base-10 Basics

We will take a short detour through numeric notations to lay out the terminology.

Decimal Notation

Decimal notation uses the digits 0 through 9, which carry value according to their position. Consider the decimal number <m>123.4</m>:

Digit	1	2	3	.	4
Value	<m>*10	2</m> (100)	<m>*10	1</m> (20)	<m>*10	0</m> (3)		<m>*10	-1</m> (0.4)

Scientific Notation

Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point – very large, or very small numbers. For these, you would use scientific notation, where a number is written as <m>m*10^n</m>.

The <m>m</m> is the significand, also called “mantissa” (which is the more common term when talking about floating point numbers in computing), while <m>n</m> is the exponent and <m>10</m> is the base.

Let's return to our number <m>123.4</m>, which could be written as <m>123.4*10^0</m>. Any number to the zeroth power equals one, so multiplying by <m>10^0</m> is a no-op (and usually omitted, returning to normal decimal notation).

Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write <m>12.34*10^1</m>, <m>1234*10^-1</m>, or <m>1.234*10^2</m> – the decimal point is “floating” along the significand.

This allows us to write very large numbers like <m>1.234*10^78</m>, or very small numbers like <m>1.234*10^-34</m> efficiently, by scaling the exponent appropriately.

Normalized Number

The two numbers in the previous paragraph were written so that there was only one significant digit in front of the decimal point. This is called a normalized number. It allows for quick comparison of numbers as the exponent now tells us the order of magnitude at a glance.

Base-2

Now that we had a look at scientific notation and know what a normalized number is, let us have a look at how a computer actually encodes floating point values on the binary level.

Binary Encoding

First thing, the base used by a computer is (usually) 2 instead of 10, for obvious reasons. It is implied, and does not need to be stored.

One bit is used as sign bit for the mantissa (i.e. whether the whole number is positive or negative).

A number of bits is set apart for the mantissa, and a number of bits is set apart for the exponent.

For easier visualization, we will use a (theoretical) 6-bit notation: One sign bit, two mantissa bits, and three exponent bits. This allows us to have a look at every possible value.

Sign

Unset indicates positive, set negative. Yes, this format has a signed zero.

Mantissa

The mantissa is assumed to be normalized, i.e. the highest (non-fractional) bit is assumed to be set. It has a value of <m>2^0</m>, i.e. 1, with following fractional bits having incrementing negative exponents. Under this assumption, the highest bit need not actually be stored, giving room for an additional fractional mantissa bit at the end.

Most people who read thus far can probably rattle down the values for positive base-2 exponents – one, two, four, eight, sixteen and so on. Negative exponents usually give people a moment pause. But it's quite easy: While each positive increment doubles the value, each decrement halves it – <m>0.5</m>, <m>0.25</m>, <m>0.125</m>, <m>0.0625</m> and so on.

Exponent

The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as <m>1-2^(bits-1)</m>. “All bits clear” and “all bits set” are reserved (we will come to this later), so the smallest meaningful exponent is 1.

For our 3-bit exponent, we get a bias of <m>1-2^2 = -3</m>, with a practical range of <m>-2..3</m>:

Exponent	Value
000	reserved
001	-2
010	-1
011	0
100	1
101	2
110	3
111	reserved

Exceptions

Infinity

If all exponent bits are set and all mantissa bits are cleared, that indicates an “infinity” value.

This usually happens as a result of overflow, or a division by zero. (Note: Division by zero does not give “infinity” as a result, mathematically. This is a quirk of the IEEE 754 floating point format!) Positive or negative infinites are indicated by the sign bit.

Not a Number

If all exponent bits are set and there are mantissa bits set, that indicates “not a number” (NaN).

This can happen for certain operations like taking the square root of a negative value.

A processor might support signaling NaNs, which lead to a processor exception when used in an operation. To indicate whether a NaN is “signalling” or “quiet”, the highest mantissa bit is used, albeit in a processor-specific way. PA_RISC and old MIPS processors use that bit as is_signalling, while most other processors use it as is_quiet. The latter procedure allows to “silence” a NaN (by setting the bit) without inadvertently turning a NaN into infinity.

The sign bit does not matter for NaN values.

Denormalized Numbers

When all exponent bits are cleared, the mantissa is considered to be denormalized, with an exponent of <m>1-(bias)</m>.

In other words, the smallest representable exponent (with only the lowest bit set) is applied, with the usually implied pre-fractional bit unset. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable.

Using our theoretical 6-bit format:

Exponent	Mantissa (stored)	Mantissa (implied)	Value
001	11	1.11	<m>1.75 * 2	(-2) = 0.4375</m>
001	10	1.10	<m>1.50 * 2	(-2) = 0.375</m>
001	01	1.01	<m>1.25 * 2	(-2) = 0.3125</m>
001	00	1.00	<m>1.00 * 2	(-2) = 0.25</m>
000	11	0.11	0<m>.75 * 2	(-2) = 0.1875</m>
000	10	0.10	0<m>.50 * 2	(-2) = 0.125</m>
000	01	0.01	0<m>.25 * 2	(-2) = 0.0625</m>
000	00	0.00	0

Formats

IEEE 754

The IEEE 754 standard defines various formats, of which basically only two are really of interest when looking at floating point support in a C standard library: Single and double precision.

Format	Single Precision	Double Precision	Quadruple Precision
Overall Width	32 bits	64 bits	128 bits
Significand	24 bits	53 bits	113 bits
Exponent	8 bits	11 bits	15 bits
Exponent Bias	127	1023	16383
E min	-126	-1022	-16382
E max	+127	+1023	+16383

x86 Extended Precision Format

The x86 Extended Precision Format is an exception insofar as it stores the non-fractional part of the mantissa explicitly. This has little practical impact; while FPUs up to the 80286 did use that integral mantissa bit for “unnormal” encodings, from the 80386 onward such “unnormal” values are no longer generated, and treated as invalid operand if encountered.

The integral mantissa bit is zero for denormalized numbers and zero (obviously), and one for normalized numbers, infinity, and NaNs.

A special case is the pattern for non-signalling NaNs (all exponent bits set, integral and highest fractional mantissa bit set, with all other mantissa bits cleared). This is considered a “floating point indefinite”, a special case of non-signalling NaN.

Format	x86 Extended Precision
Overall Width	80 bits
Significand	63 (+1) bits
Exponent	15 bits
Exponent Bias	16383
E min	-16382
E max	+16383

Dragon 4 and Grisu 3

The seminal works on the conversion of binary floating point to decimal string representation are:

Guy Steele, Jon White (1990): How to Print Floating-Point Numbers Accurately
David Gay (1990): Correctly Rounded Binary-Decimal and Decimal-Binary Conversions
Robert G. Burger, R. Kent Dybvig (1996): Printing Floating-Point Numbers Quickly and Accurately
Florian Loitsch (2010): Printing Floating-Point Numbers Quickly and Accurately with Integers

Steele & White reference another seminal work on the opposite conversion, from decimal string to binary floating point representation:

William D. Clinger (1990): How to read floating point numbers accurately

That first paper by Steele & White presented the “Dragon” algorithm in its fourth iteration, to which Gay and Burger / Dybvig submitted improvements. Loitsch presented a significant performance improvement he called “Grisu”, of which there are three iterations.

(Thanks to Ryan Juckett for writing a very nice introduction on the subject, which was at the root of what I assembled on this page.)

Dragon

Steele & White approach the presentation of their Dragon algorithm as a series of “lesser” algorithms building on each other.

Fixed Point Fraction Output

Given:

A positive value f less than 1, with n digits to (input) base b (usually 2).

Output:

A value F of N digits to (output) base B (usually 10).

Such that:

<m>delim{|}{F - f}{|} < { b^{-n} } / 2</m>
- The difference between representations is less than half the positional value of the nth digit of f.
<m>N</m> is the smallest number of digits so that 1. can be true.
<m>delim{|}{F - f}{|} ⇐ { B^{-N} } / 2</m>
- The difference between representations is no more than half the positional value of the Nth digit of F.
Each digit of F is output before the next is generated; no “back up” for corrections.

Algorithm <m>(FP)^3</m> (Finite-Precision Fixed-Point Fraction Printout):

<m>k = 0, R = f, M = { b ^ { -n } / 2 }</m>
while ( 1 )
- k++
- U = floor( R * B )
- R = ( R * B ) % 1
- M *= B
- if ( <m>R >= M AND ⇐ 1 - M</m> )
  - append( F, U )
- else
  - break
if ( <m>R ⇐ 1/2</m> )
- append( F, U )
if ( <m>R >= 1/2</m> )
- append( F, U + 1 )

At the end, <m>k</m> is <m>N</m>, i.e. the number of digits in <m>F</m>.

Example:

Given the base <m>b = 2</m> number <m>f = .10110</m>, with <m>n = 5</m>, to be converted to base <m>B = 10</m>
The exact decimal representation would be <m>.6835</m>
- The next higher number (<m>f_{+1} = .10111</m>) would be decimal <m>0.71475</m>
- The next lower number (<m>f_{-1} = .10101</m>) would be decimal <m>0.65225</m>
The Mask would be <m>M = { b ^ {-n} } / 2 = { 2 ^ { -5 } } / 2 = 0.015625</m>

First (<m>k = 1</m>) loop
- multiply <m>R = 0.6835</m> by <m>B = 10</m> for integral part <m>6</m>, fractional part <m>0.835</m>
- multiply <m>M = 0.015625</m> by <m>B = 10</m> for new <m>M = 0.15625</m>
- Fractional part <m>0.835</m> is larger than mask <m>0.15625</m>, and smaller than <m>1 - 0.15625 = 0.84375</m>, so <m>6</m> is our first (fractional) digit, and the loop continues
Second (<m>k = 2</m>) loop
- multiply <m>R = 0.835</m> by <m>B = 10</m> for integral part <m>8</m>, fractional part <m>0.35</m>
- multiply <m>M = 0.15625</m> by <m>B = 10</m> for new <m>M = 1.5625</m>
- Fractional part <m>.35</m> is smaller than mask <m>1.5625</m>, and not smaller than <m>1 - 1.5625 = -0.5625</m>, so the loop terminates
Post-loop
- Fractional part <m>.35</m> is smaller than <m>1/2</m>, so <m>8</m> is our next fractional digit
- We have <m>N = k = 2</m> fractional digits in our result of <m>0.68</m>, which is the smallest <m>N</m> that uniquely identifies our original <m>f = .10110</m>

Floating-Point Printout