====== Floating Point Explained ======

As an extension to any internal documentation of PDCLib, and because being able to //explain// a subject is a good test whether I myself have //understood// the subject, I'll give an introduction to floating point format here.

===== Base-10 Basics =====

We will take a short detour through numeric notations to lay out the terminology.

==== Decimal Notation ====

[[https://en.wikipedia.org/wiki/Decimal_notation | Decimal notation]] uses the digits 0 through 9, which carry value according to their position. Consider the decimal number <m>123.4</m>:

^ Digit ^ 1 ^ 2 ^ 3 ^ . ^ 4 ^
^ Value | <m>*10^2</m> (100) | <m>*10^1</m> (20) | <m>*10^0</m> (3) | | <m>*10^-1</m> (0.4) |

==== Scientific Notation ====

Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point -- very large, or very small numbers. For these, you would use [[https://en.wikipedia.org/wiki/Scientific_notation | scientific notation]], where a number is written as <m>m*10^n</m>.

The <m>m</m> is the [[https://en.wikipedia.org/wiki/Significand | significand]], also called "mantissa" (which is the more common term when talking about floating point numbers in computing), while <m>n</m> is the [[https://en.wikipedia.org/wiki/Exponent | exponent]] and <m>10</m> is the [[https://en.wikipedia.org/wiki/Base_(exponentiation) | base]].

Let's return to our number <m>123.4</m>, which could be written as <m>123.4*10^0</m>. Any number to the zeroth power equals one, so multiplying by <m>10^0</m> is a no-op (and usually omitted, returning to normal //decimal notation//).

Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write <m>12.34*10^1</m>, <m>1234*10^-1</m>, or <m>1.234*10^2</m> -- the decimal point is "floating" along the significand.

This allows us to write very large numbers like <m>1.234*10^78</m>, or very small numbers like <m>1.234*10^-34</m> efficiently, by scaling the exponent appropriately.

==== Normalized Number ====

The two numbers in the previous paragraph were written so that there was only one significant digit in front of the decimal point. This is called a [[https://en.wikipedia.org/wiki/Normalized_number | normalized number]]. It allows for quick comparison of numbers as the exponent now tells us the //order of magnitude// at a glance.

===== Base-2 =====

Now that we had a look at //scientific notation// and know what a //normalized number// is, let us have a look at how a computer actually encodes floating point values on the binary level.

==== Binary Encoding ====

First thing, the //base// used by a computer is (usually) 2 instead of 10, for obvious reasons. It is implied, and does not need to be stored.

One bit is used as //sign bit// for the mantissa (i.e. whether the whole number is positive or negative).

A number of bits is set apart for the mantissa, and a number of bits is set apart for the exponent.

For easier visualization, we will use a (theoretical) 6-bit notation: One sign bit, two mantissa bits, and three exponent bits. This allows us to have a look at every possible value.

=== Sign ===

Unset indicates positive, set negative. Yes, this format has a [[https://en.wikipedia.org/wiki/Signed_zero | signed zero]].

=== Mantissa ===

The mantissa is assumed to be //normalized//, i.e. the highest (non-fractional) bit is assumed to be set. It has a value of <m>2^0</m>, i.e. 1, with following fractional bits having incrementing negative exponents. Under this assumption, the highest bit need not actually be stored, giving room for an additional //fractional// mantissa bit at the end.

Most people who read thus far can probably rattle down the values for //positive// base-2 exponents -- one, two, four, eight, sixteen and so on. //Negative// exponents usually give people a moment pause. But it's quite easy: While each positive increment //doubles// the value, each decrement //halves// it -- <m>0.5</m>, <m>0.25</m>, <m>0.125</m>, <m>0.0625</m> and so on.

=== Exponent ===

The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as <m>1-2^(bits-1)</m>. "All bits clear" and "all bits set" are reserved (we will come to this later), so the smallest meaningful exponent is 1.

For our 3-bit exponent, we get a bias of <m>1-2^2 = -3</m>, with a practical range of <m>-2..3</m>:

^ Exponent ^ Value ^
^ 000 | reserved |
^ 001 | -2 |
^ 010 | -1 |
^ 011 | 0 |
^ 100 | 1 |
^ 101 | 2 |
^ 110 | 3 |
^ 111 | reserved |

==== Exceptions ====

=== Infinity ===

If all exponent bits are set and all mantissa bits are cleared, that indicates an "infinity" value.

This usually happens as a result of overflow, or a division by zero. (**Note:** Division by zero does //not// give "infinity" as a result, mathematically. This is a quirk of the IEEE 754 floating point format!) Positive or negative infinites are indicated by the sign bit. 

=== Not a Number ===

If all exponent bits are set and there are mantissa bits set, that indicates "not a number" (//NaN//).

This can happen for certain operations like taking the square root of a negative value.

A processor might support //signaling NaNs//, which lead to a processor exception when used in an operation. To indicate whether a //NaN// is "signalling" or "quiet", the highest mantissa bit is used, albeit in a processor-specific way. PA_RISC and old MIPS processors use that bit as ''is_signalling'', while most other processors use it as ''is_quiet''. The latter procedure allows to "silence" a //NaN// (by setting the bit) without inadvertently turning a //NaN// into //infinity//.

The sign bit does not matter for //NaN// values.

=== Denormalized Numbers ===

When all exponent bits are cleared, the mantissa is considered to be [[https://en.wikipedia.org/wiki/Denormal_number | denormalized]], with an exponent of <m>1-(bias)</m>.

In other words, the smallest representable exponent (with only the lowest bit set) is applied, //with the usually implied pre-fractional bit unset//. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable.

Using our theoretical 6-bit format:

^ Exponent ^ Mantissa (stored) ^ Mantissa (implied) ^ Value ^
| 001      | 11                | 1.11               | <m>1.75 * 2^(-2) = 0.4375</m> |
| 001      | 10                | 1.10               | <m>1.50 * 2^(-2) = 0.375</m>  |
| 001      | 01                | 1.01               | <m>1.25 * 2^(-2) = 0.3125</m> |
| 001      | 00                | 1.00               | <m>1.00 * 2^(-2) = 0.25</m>   |
| 000      | 11                | **0**.11               | **0**<m>.75 * 2^(-2) = 0.1875</m> |
| 000      | 10                | **0**.10               | **0**<m>.50 * 2^(-2) = 0.125</m>  |
| 000      | 01                | **0**.01               | **0**<m>.25 * 2^(-2) = 0.0625</m> |
| 000      | 00                | **0**.00               | **0** |

==== Formats ====

=== IEEE 754 ===

The [[https://en.wikipedia.org/wiki/IEEE_754 | IEEE 754]] standard defines various formats, of which basically only two are really of interest when looking at floating point support in a C standard library: Single and double precision.

^ Format        ^ Single Precision ^ Double Precision ^ Quadruple Precision ^
^ Overall Width | 32 bits          | 64 bits          | 128 bits            |
^ Significand   | 24 bits          | 53 bits          | 113 bits            |
^ Exponent      | 8 bits           | 11 bits          | 15 bits             |
^ Exponent Bias | 127              | 1023             | 16383               |
^ E min         | -126             | -1022            | -16382              |
^ E max         | +127             | +1023            | +16383              |

=== x86 Extended Precision Format ===

The [[https://en.wikipedia.org/wiki/Extended_precision#x86_extended_precision_format | x86 Extended Precision Format]] is an exception insofar as it stores the non-fractional part of the mantissa //explicitly//. This has little practical impact; while FPUs up to the 80286 did use that integral mantissa bit for "unnormal" encodings, from the 80386 onward such "unnormal" values are no longer generated, and treated as invalid operand if encountered.

The integral mantissa bit is zero for denormalized numbers and zero (obviously), and one for normalized numbers, infinity, and NaNs.

A special case is the pattern for non-signalling NaNs (all exponent bits set, integral and highest fractional mantissa bit set, //with all other mantissa bits cleared//). This is considered a "floating point indefinite", a special case of non-signalling NaN.

^ Format        ^ x86 Extended Precision ^
^ Overall Width | 80 bits                |
^ Significand   | 63 (+1) bits           |
^ Exponent      | 15 bits                |
^ Exponent Bias | 16383                  |
^ E min         | -16382                 |
^ E max         | +16383                 |

====== Dragon 4 and Grisu 3 ======

The seminal works on the conversion of binary floating point to decimal string representation are:

  * Guy Steele, Jon White (1990): [[https://doi.org/10.1145/93548.93559 | How to Print Floating-Point Numbers Accurately]]
  * David Gay (1990): [[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4049 | Correctly Rounded Binary-Decimal and Decimal-Binary Conversions]]
  * Robert G. Burger, R. Kent Dybvig (1996): [[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.2247 | Printing Floating-Point Numbers Quickly and Accurately]]
  * Florian Loitsch (2010): [[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.1021.1906 | Printing Floating-Point Numbers Quickly and Accurately with Integers]]

Steele & White reference another seminal work on the opposite conversion, from decimal string to binary floating point representation:

  * William D. Clinger (1990): [[https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.4152 | How to read floating point numbers accurately]]

That first paper by Steele & White presented the "Dragon" algorithm in its fourth iteration, to which Gay and Burger / Dybvig submitted improvements. Loitsch presented a significant performance improvement he called "Grisu", of which there are three iterations.

(Thanks to [[https://www.ryanjuckett.com/programming/printing-floating-point-numbers/ | Ryan Juckett]] for writing a very nice introduction on the subject, which was at the root of what I assembled on this page.)

===== Dragon =====

Steele & White approach the presentation of their Dragon algorithm as a series of "lesser" algorithms building on each other.

==== Fixed Point Fraction Output ====

Given:

  * A positive value //f// less than 1, with //n// digits to (input) base //b// (usually 2).

Output:

  * A value //F// of //N// digits to (output) base //B// (usually 10).

Such that:

  - <m>delim{|}{F - f}{|} < { b^{-n} } / 2</m>
    * The difference between representations is less than half the positional value of the //n//th digit of //f//.
  - <m>N</m> is the smallest number of digits so that 1. can be true.
  - <m>delim{|}{F - f}{|} <= { B^{-N} } / 2</m>
    * The difference between representations is no more than half the positional value of the //N//th digit of //F//.
  - Each digit of //F// is output before the next is generated; no "back up" for corrections.

Algorithm <m>(FP)^3</m> (Finite-Precision Fixed-Point Fraction Printout):

  * <m>k = 0, R = f, M = { b ^ { -n } / 2 }</m>
  * while ( 1 )
    * k++
    * U = floor( R * B )
    * R = ( R * B ) % 1
    * M *= B
    * if ( <m>R >= M AND <= 1 - M</m> )
      * append( F, U )
    * else
      * break
  * if ( <m>R <= 1/2</m> )
    * append( F, U )
  * if ( <m>R >= 1/2</m> )
    * append( F, U + 1 )

At the end, <m>k</m> is <m>N</m>, i.e. the number of digits in <m>F</m>.

Example:

  * Given the base <m>b = 2</m> number <m>f = .10110</m>, with <m>n = 5</m>, to be converted to base <m>B = 10</m>
  * The //exact// decimal representation would be <m>.6835</m>
    * The next higher number (<m>f_{+1} = .10111</m>) would be decimal <m>0.71475</m>
    * The next lower number (<m>f_{-1} = .10101</m>) would be decimal <m>0.65225</m>
  * The Mask would be <m>M = { b ^ {-n} } / 2 = { 2 ^ { -5 } } / 2 = 0.015625</m>

  * First (<m>k = 1</m>) loop
    * multiply <m>R = 0.6835</m> by <m>B = 10</m> for integral part <m>6</m>, fractional part <m>0.835</m>
    * multiply <m>M = 0.015625</m> by <m>B = 10</m> for new <m>M = 0.15625</m>
    * Fractional part <m>0.835</m> is larger than mask <m>0.15625</m>, and smaller than <m>1 - 0.15625 = 0.84375</m>, so <m>6</m> is our first (fractional) digit, and the loop continues
  * Second (<m>k = 2</m>) loop
    * multiply <m>R = 0.835</m> by <m>B = 10</m> for integral part <m>8</m>, fractional part <m>0.35</m>
    * multiply <m>M = 0.15625</m> by <m>B = 10</m> for new <m>M = 1.5625</m>
    * Fractional part <m>.35</m> is smaller than mask <m>1.5625</m>, and not smaller than <m>1 - 1.5625 = -0.5625</m>, so the loop terminates
  * Post-loop
    * Fractional part <m>.35</m> is smaller than <m>1/2</m>, so <m>8</m> is our next fractional digit
    * We have <m>N = k = 2</m> fractional digits in our result of <m>0.68</m>, which is the smallest <m>N</m> that uniquely identifies our original <m>f = .10110</m>

==== Floating-Point Printout ====

Given:

  * A base //b// floating-point number <m>v = f * b ^ (e - p)</m>, where //e//, //f//, and //p// are integers with <m>p >= 0</m> and <m>0 <= f < b ^ p</m>.

Output:

  * An approximate representation to base //B//, using exponential notation if the value is very large or very small.

Algorithm (Dragon2):

   * Compute <m>v prime = v * B ^ {-x}</m>, where <m>x</m> is chosen so that the result either is between 1 / //B// and 1, or between 1 and //B// (normalization, either with leading zero or leading non-zero digit)
   * Print the integer part of <m>v prime</m>
   * Print a decimal point
   * Take the fractional part of <m>v prime</m>, let <m>n = p - floor ( log_ b v prime ) - 1</m>
   * Apply algorithm <m>(FP)^3</m> to //f// and //n//
   * If <m>x</m> was not zero, append the letter "E" and a representation of the scale factor <m>x</m>