User Tools

Site Tools


Floating Point Explained

As an extension to any internal documentation of PDCLib, and because being able to explain a subject is a good test whether I myself have understood the subject, I'll give an introduction to floating point format here.

Base-10 Basics

We will take a short detour through numeric notations to lay out the terminology.

Decimal Notation

Decimal notation uses the digits 0 through 9, which carry value according to their position. Consider the decimal number <m>123.4</m>:

Digit 1 2 3 . 4
Value <m>*102</m> (100) <m>*101</m> (20) <m>*100</m> (3) <m>*10-1</m> (0.4)

Scientific Notation

Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point – very large, or very small numbers. For these, you would use scientific notation, where a number is written as <m>m*10^n</m>.

The <m>m</m> is the significand, also called “mantissa” (which is the more common term when talking about floating point numbers in computing), while <m>n</m> is the exponent and <m>10</m> is the base.

Let's return to our number <m>123.4</m>, which could be written as <m>123.4*10^0</m>. Any number to the zeroth power equals one, so multiplying by <m>10^0</m> is a no-op (and usually omitted, returning to normal decimal notation).

Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write <m>12.34*10^1</m>, <m>1234*10^-1</m>, or <m>1.234*10^2</m> – the decimal point is “floating” along the significand.

This allows us to write very large numbers like <m>1.234*10^78</m>, or very small numbers like <m>1.234*10^-34</m> efficiently, by scaling the exponent appropriately.

Normalized Number

The two numbers in the previous paragraph were written so that there was only one significant digit in front of the decimal point. This is called a normalized number. It allows for quick comparison of numbers as the exponent now tells us the order of magnitude at a glance.


Now that we had a look at scientific notation and know what a normalized number is, let us have a look at how a computer actually encodes floating point values on the binary level.

Binary Encoding

First thing, the base used by a computer is (usually) 2 instead of 10, for obvious reasons. It is implied, and does not need to be stored.

One bit is used as sign bit for the mantissa (i.e. whether the whole number is positive or negative).

A number of bits is set apart for the mantissa, and a number of bits is set apart for the exponent.

For easier visualization, we will use a (theoretical) 6-bit notation: One sign bit, two mantissa bits, and three exponent bits. This allows us to have a look at every possible value.


Unset indicates positive, set negative. Yes, this format has a signed zero.


The mantissa is assumed to be normalized, i.e. the highest (non-fractional) bit is assumed to be set. It has a value of <m>2^0</m>, i.e. 1, with following fractional bits having incrementing negative exponents. Under this assumption, the highest bit need not actually be stored, giving room for an additional fractional mantissa bit at the end.

Most people who read thus far can probably rattle down the values for positive base-2 exponents – one, two, four, eight, sixteen and so on. Negative exponents usually give people a moment pause. But it's quite easy: While each positive increment doubles the value, each decrement halves it – <m>0.5</m>, <m>0.25</m>, <m>0.125</m>, <m>0.0625</m> and so on.


The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as <m>1-2^(bits-1)</m>. “All bits clear” and “all bits set” are reserved (we will come to this later), so the smallest meaningful exponent is 1.

For our 3-bit exponent, we get a bias of <m>1-2^2 = -3</m>, with a practical range of <m>-2..3</m>:

Exponent Value
000 reserved
001 -2
010 -1
011 0
100 1
101 2
110 3
111 reserved



If all exponent bits are set and all mantissa bits are cleared, that indicates an “infinity” value.

This usually happens as a result of overflow, or a division by zero. (Note: Division by zero does not give “infinity” as a result, mathematically. This is a quirk of the IEEE 754 floating point format!) Positive or negative infinites are indicated by the sign bit.

Not a Number

If all exponent bits are set and there are mantissa bits set, that indicates “not a number” (NaN).

This can happen for certain operations like taking the square root of a negative value.

A processor might support signaling NaNs, which lead to a processor exception when used in an operation. To indicate whether a NaN is “signalling” or “quiet”, the highest mantissa bit is used, albeit in a processor-specific way. PA_RISC and old MIPS processors use that bit as is_signalling, while most other processors use it as is_quiet. The latter procedure allows to “silence” a NaN (by setting the bit) without inadvertently turning a NaN into infinity.

The sign bit does not matter for NaN values.

Denormalized Numbers

When all exponent bits are cleared, the mantissa is considered to be denormalized, with an exponent of <m>1-(bias)</m>.

In other words, the smallest representable exponent (with only the lowest bit set) is applied, with the usually implied pre-fractional bit unset. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable.

Using our theoretical 6-bit format:

Exponent Mantissa (stored) Mantissa (implied) Value
001 11 1.11 <m>1.75 * 2(-2) = 0.4375</m>
001 10 1.10 <m>1.50 * 2(-2) = 0.375</m>
001 01 1.01 <m>1.25 * 2(-2) = 0.3125</m>
001 00 1.00 <m>1.00 * 2(-2) = 0.25</m>
000 11 0.11 0<m>.75 * 2(-2) = 0.1875</m>
000 10 0.10 0<m>.50 * 2(-2) = 0.125</m>
000 01 0.01 0<m>.25 * 2(-2) = 0.0625</m>
000 00 0.00 0


IEEE 754

The IEEE 754 standard defines various formats, of which basically only two are really of interest when looking at floating point support in a C standard library: Single and double precision.

Format Single Precision Double Precision Quadruple Precision
Overall Width 32 bits 64 bits 128 bits
Significand 24 bits 53 bits 113 bits
Exponent 8 bits 11 bits 15 bits
Exponent Bias 127 1023 16383
E min -126 -1022 -16382
E max +127 +1023 +16383

x86 Extended Precision Format

The x86 Extended Precision Format is an exception insofar as it stores the non-fractional part of the mantissa explicitly. This has little practical impact; while FPUs up to the 80286 did use that integral mantissa bit for “unnormal” encodings, from the 80386 onward such “unnormal” values are no longer generated, and treated as invalid operand if encountered.

The integral mantissa bit is zero for denormalized numbers and zero (obviously), and one for normalized numbers, infinity, and NaNs.

A special case is the pattern for non-signalling NaNs (all exponent bits set, integral and highest fractional mantissa bit set, with all other mantissa bits cleared). This is considered a “floating point indefinite”, a special case of non-signalling NaN.

Format x86 Extended Precision
Overall Width 80 bits
Significand 63 (+1) bits
Exponent 15 bits
Exponent Bias 16383
E min -16382
E max +16383

Dragon 4 and Grisu 3

The seminal works on the conversion of binary floating point to decimal string representation are:

Steele & White reference another seminal work on the opposite conversion, from decimal string to binary floating point representation:

That first paper by Steele & White presented the “Dragon” algorithm in its fourth iteration, to which Gay and Burger / Dybvig submitted improvements. Loitsch presented a significant performance improvement he called “Grisu”, of which there are three iterations.

(Thanks to Ryan Juckett for writing a very nice introduction on the subject, which was at the root of what I assembled on this page.)


Steele & White approach the presentation of their Dragon algorithm as a series of “lesser” algorithms building on each other.

Fixed Point Fraction Output


  • A positive value f less than 1, with n digits to (input) base b (usually 2).


  • A value F of N digits to (output) base B (usually 10).

Such that:

  1. <m>delim{|}{F - f}{|} < { b^{-n} } / 2</m>
    • The difference between representations is less than half the positional value of the nth digit of f.
  2. <m>N</m> is the smallest number of digits so that 1. can be true.
  3. <m>delim{|}{F - f}{|} ⇐ { B^{-N} } / 2</m>
    • The difference between representations is no more than half the positional value of the Nth digit of F.
  4. Each digit of F is output before the next is generated; no “back up” for corrections.

Algorithm <m>(FP)^3</m> (Finite-Precision Fixed-Point Fraction Printout):

  • <m>k = 0, R = f, M = { b ^ { -n } / 2 }</m>
  • while ( 1 )
    • k++
    • U = floor( R * B )
    • R = ( R * B ) % 1
    • M *= B
    • if ( <m>R >= M AND ⇐ 1 - M</m> )
      • append( F, U )
    • else
      • break
  • if ( <m>R ⇐ 1/2</m> )
    • append( F, U )
  • if ( <m>R >= 1/2</m> )
    • append( F, U + 1 )

At the end, <m>k</m> is <m>N</m>, i.e. the number of digits in <m>F</m>.


  • Given the base <m>b = 2</m> number <m>f = .10110</m>, with <m>n = 5</m>, to be converted to base <m>B = 10</m>
  • The exact decimal representation would be <m>.6835</m>
    • The next higher number (<m>f_{+1} = .10111</m>) would be decimal <m>0.71475</m>
    • The next lower number (<m>f_{-1} = .10101</m>) would be decimal <m>0.65225</m>
  • The Mask would be <m>M = { b ^ {-n} } / 2 = { 2 ^ { -5 } } / 2 = 0.015625</m>
  • First (<m>k = 1</m>) loop
    • multiply <m>R = 0.6835</m> by <m>B = 10</m> for integral part <m>6</m>, fractional part <m>0.835</m>
    • multiply <m>M = 0.015625</m> by <m>B = 10</m> for new <m>M = 0.15625</m>
    • Fractional part <m>0.835</m> is larger than mask <m>0.15625</m>, and smaller than <m>1 - 0.15625 = 0.84375</m>, so <m>6</m> is our first (fractional) digit, and the loop continues
  • Second (<m>k = 2</m>) loop
    • multiply <m>R = 0.835</m> by <m>B = 10</m> for integral part <m>8</m>, fractional part <m>0.35</m>
    • multiply <m>M = 0.15625</m> by <m>B = 10</m> for new <m>M = 1.5625</m>
    • Fractional part <m>.35</m> is smaller than mask <m>1.5625</m>, and not smaller than <m>1 - 1.5625 = -0.5625</m>, so the loop terminates
  • Post-loop
    • Fractional part <m>.35</m> is smaller than <m>1/2</m>, so <m>8</m> is our next fractional digit
    • We have <m>N = k = 2</m> fractional digits in our result of <m>0.68</m>, which is the smallest <m>N</m> that uniquely identifies our original <m>f = .10110</m>

Floating-Point Printout


  • A base b floating-point number <m>v = f * b ^ (e - p)</m>, where e, f, and p are integers with <m>p >= 0</m> and <m>0 ⇐ f < b ^ p</m>.


  • An approximate representation to base B, using exponential notation if the value is very large or very small.

Algorithm (Dragon2):

  • Compute <m>v prime = v * B ^ {-x}</m>, where <m>x</m> is chosen so that the result either is between 1 / B and 1, or between 1 and B (normalization, either with leading zero or leading non-zero digit)
  • Print the integer part of <m>v prime</m>
  • Print a decimal point
  • Take the fractional part of <m>v prime</m>, let <m>n = p - floor ( log_ b v prime ) - 1</m>
  • Apply algorithm <m>(FP)^3</m> to f and n
  • If <m>x</m> was not zero, append the letter “E” and a representation of the scale factor <m>x</m>
pdclib/floatingpoint.txt · Last modified: 2019/10/21 10:13 by solar