As an extension to any internal documentation of PDCLib, and because being able to explain a subject is a good test whether I myself have understood the subject, I'll give an introduction to floating point format here.
We will take a short detour through numeric notations to lay out the terminology.
Decimal notation uses the digits 0 through 9, which carry value according to their position. Consider the decimal number <m>123.4</m>:
Digit | 1 | 2 | 3 | . | 4 | ||||
---|---|---|---|---|---|---|---|---|---|
Value | <m>*10 | 2</m> (100) | <m>*10 | 1</m> (20) | <m>*10 | 0</m> (3) | <m>*10 | -1</m> (0.4) |
Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point – very large, or very small numbers. For these, you would use scientific notation, where a number is written as <m>m*10^n</m>.
The <m>m</m> is the significand, also called “mantissa” (which is the more common term when talking about floating point numbers in computing), while <m>n</m> is the exponent and <m>10</m> is the base.
Let's return to our number <m>123.4</m>, which could be written as <m>123.4*10^0</m>. Any number to the zeroth power equals one, so multiplying by <m>10^0</m> is a no-op (and usually omitted, returning to normal decimal notation).
Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write <m>12.34*10^1</m>, <m>1234*10^-1</m>, or <m>1.234*10^2</m> – the decimal point is “floating” along the significand.
This allows us to write very large numbers like <m>1.234*10^78</m>, or very small numbers like <m>1.234*10^-34</m> efficiently, by scaling the exponent appropriately.
The two numbers in the previous paragraph were written so that there was only one significant digit in front of the decimal point. This is called a normalized number. It allows for quick comparison of numbers as the exponent now tells us the order of magnitude at a glance.
Now that we had a look at scientific notation and know what a normalized number is, let us have a look at how a computer actually encodes floating point values on the binary level.
First thing, the base used by a computer is (usually) 2 instead of 10, for obvious reasons. It is implied, and does not need to be stored.
One bit is used as sign bit for the mantissa (i.e. whether the whole number is positive or negative).
A number of bits is set apart for the mantissa, and a number of bits is set apart for the exponent.
For easier visualization, we will use a (theoretical) 6-bit notation: One sign bit, two mantissa bits, and three exponent bits. This allows us to have a look at every possible value.
Unset indicates positive, set negative. Yes, this format has a signed zero.
The mantissa is assumed to be normalized, i.e. the highest (non-fractional) bit is assumed to be set. It has a value of <m>2^0</m>, i.e. 1, with following fractional bits having incrementing negative exponents. Under this assumption, the highest bit need not actually be stored, giving room for an additional fractional mantissa bit at the end.
Most people who read thus far can probably rattle down the values for positive base-2 exponents – one, two, four, eight, sixteen and so on. Negative exponents usually give people a moment pause. But it's quite easy: While each positive increment doubles the value, each decrement halves it – <m>0.5</m>, <m>0.25</m>, <m>0.125</m>, <m>0.0625</m> and so on.
The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as <m>1-2^(bits-1)</m>. “All bits clear” and “all bits set” are reserved (we will come to this later), so the smallest meaningful exponent is 1.
For our 3-bit exponent, we get a bias of <m>1-2^2 = -3</m>, with a practical range of <m>-2..3</m>:
Exponent | Value |
---|---|
000 | reserved |
001 | -2 |
010 | -1 |
011 | 0 |
100 | 1 |
101 | 2 |
110 | 3 |
111 | reserved |
If all exponent bits are set and all mantissa bits are cleared, that indicates an “infinity” value.
This usually happens as a result of overflow, or a division by zero. (Note: Division by zero does not give “infinity” as a result, mathematically. This is a quirk of the IEEE 754 floating point format!) Positive or negative infinites are indicated by the sign bit.
If all exponent bits are set and there are mantissa bits set, that indicates “not a number” (NaN).
This can happen for certain operations like taking the square root of a negative value.
A processor might support signaling NaNs, which lead to a processor exception when used in an operation. To indicate whether a NaN is “signalling” or “quiet”, the highest mantissa bit is used, albeit in a processor-specific way. PA_RISC and old MIPS processors use that bit as is_signalling
, while most other processors use it as is_quiet
. The latter procedure allows to “silence” a NaN (by setting the bit) without inadvertently turning a NaN into infinity.
The sign bit does not matter for NaN values.
When all exponent bits are cleared, the mantissa is considered to be denormalized, with an exponent of <m>1-(bias)</m>.
In other words, the smallest representable exponent (with only the lowest bit set) is applied, with the usually implied pre-fractional bit unset. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable.
Using our theoretical 6-bit format:
Exponent | Mantissa (stored) | Mantissa (implied) | Value | |
---|---|---|---|---|
001 | 11 | 1.11 | <m>1.75 * 2 | (-2) = 0.4375</m> |
001 | 10 | 1.10 | <m>1.50 * 2 | (-2) = 0.375</m> |
001 | 01 | 1.01 | <m>1.25 * 2 | (-2) = 0.3125</m> |
001 | 00 | 1.00 | <m>1.00 * 2 | (-2) = 0.25</m> |
000 | 11 | 0.11 | 0<m>.75 * 2 | (-2) = 0.1875</m> |
000 | 10 | 0.10 | 0<m>.50 * 2 | (-2) = 0.125</m> |
000 | 01 | 0.01 | 0<m>.25 * 2 | (-2) = 0.0625</m> |
000 | 00 | 0.00 | 0 |
The IEEE 754 standard defines various formats, of which basically only two are really of interest when looking at floating point support in a C standard library: Single and double precision.
Format | Single Precision | Double Precision | Quadruple Precision |
---|---|---|---|
Overall Width | 32 bits | 64 bits | 128 bits |
Significand | 24 bits | 53 bits | 113 bits |
Exponent | 8 bits | 11 bits | 15 bits |
Exponent Bias | 127 | 1023 | 16383 |
E min | -126 | -1022 | -16382 |
E max | +127 | +1023 | +16383 |
The x86 Extended Precision Format is an exception insofar as it stores the non-fractional part of the mantissa explicitly. This has little practical impact; while FPUs up to the 80286 did use that integral mantissa bit for “unnormal” encodings, from the 80386 onward such “unnormal” values are no longer generated, and treated as invalid operand if encountered.
The integral mantissa bit is zero for denormalized numbers and zero (obviously), and one for normalized numbers, infinity, and NaNs.
A special case is the pattern for non-signalling NaNs (all exponent bits set, integral and highest fractional mantissa bit set, with all other mantissa bits cleared). This is considered a “floating point indefinite”, a special case of non-signalling NaN.
Format | x86 Extended Precision |
---|---|
Overall Width | 80 bits |
Significand | 63 (+1) bits |
Exponent | 15 bits |
Exponent Bias | 16383 |
E min | -16382 |
E max | +16383 |
The seminal works on the conversion of binary floating point to decimal string representation are:
Steele & White reference another seminal work on the opposite conversion, from decimal string to binary floating point representation:
That first paper by Steele & White presented the “Dragon” algorithm in its fourth iteration, to which Gay and Burger / Dybvig submitted improvements. Loitsch presented a significant performance improvement he called “Grisu”, of which there are three iterations.
(Thanks to Ryan Juckett for writing a very nice introduction on the subject, which was at the root of what I assembled on this page.)
Steele & White approach the presentation of their Dragon algorithm as a series of “lesser” algorithms building on each other.
Given:
Output:
Such that:
Algorithm <m>(FP)^3</m> (Finite-Precision Fixed-Point Fraction Printout):
At the end, <m>k</m> is <m>N</m>, i.e. the number of digits in <m>F</m>.
Example:
Given:
Output:
Algorithm (Dragon2):