User Tools

Site Tools


pdclib:printing_floating_point_numbers

Printing Floating Point Numbers

To complete the printf() function of PDCLib – which so far lacks support for %e, %f, and %g – I needed to solve the problem of converting the binary, in-memory representation of double and long double to decimal.

Using the same algorithm as for integers (divide by ten, take quotient, recurse with remainder) is not an option. Not only would repeated floating point divisions be horrendously slow: Multiplying / dividing a (base 2) floating point by 10 repeatedly would also accumulate rounding errors. You would have slow, wrong results.

What is in a floating point number

The lowest bit in an integer has the value 2⁰ (1). The next bit counts for 2¹ (2), the next for 2² (4), and so on.

The basic concept of a floating point number is that the first bit has the value 2⁰ (1), with the next having the value 2⁻¹ (0.5), then 2⁻² (0.25), and so on. With this fractional part (also called mantissa, or significant) comes an exponent, which scales the fractional part. The final value of the floating point is 𝑓×2ᵉ. Usually there is a dedicated sign bit to allow for negatives. (Yes, this means there is such a thing as a negative zero.)

IEEE 754 brought a standard for floating point formats and arithmetic, which most CPUs today adhere to. It defines a 32bit “single precision” format, and a 64bit “double precision” format. These are usually what you get when you declare a float or double, respectively.

The C type long double is trickier. It could be the same 64bit format as double, or the 128bit “quadruple precision” defined by IEEE 754. Or it could be Intel's 80bit “x86 extended precision” format.

There are a couple of special cases that need to be kept in mind when working with floating point numbers.

Implicit Decimal

The first bit of a floating point number, the one valued 2⁰, is always set (but see Subnormals below). For this reason, most floating point formats don't bother with actually storing it, and instead imply it's being set, with the first mantissa bit stored being the 2⁻¹ one.

Biased Exponent

Instead of assuming two's complement to allow for positive and negative exponents, IEEE 754 uses biased exponents: The exponent bits are interpreted as unsigned integer, but to get the “real” exponent value, you need to substract the bias value, which is FLOAT_MAX_EXP - 1, DBL_MAX_EXP - 1, or LDBL_MAX_EXP - 1, respectively.

Huh?

Remember that IEEE 754 is a floating point standard. It makes no asumptions on the integer logic of the machine. What should the exponent be encoded at? Two's compliment? You don't know if the ALU supports that! So the exponent is stored unsigned. That means that the value 1 (1×10^0, or 1×2^0) is not stored with an exponent of all zeroes, but an exponent halfway between all zeroes (signifying denormals) and all ones (signifying INF / NaN).

Infinity

If a value is too big for the exponent to represent, the value will become infinity. This is represented by an all-bits-set exponent and an all-zero mantissa. This allows to continue operating on the value and checking for overflow just once at the end, instead of after each operation.

Not a Number

Some mathematical operations (like 0.0 / 0.0) are undefined, and result in the value to take on a special state called “Not a Number” (NaN). This is represented by an all-bits-set exponent and a non-zero mantissa. The bit pattern of the mantissa can hold additional information, but that is beyond the scope of this document.

Subnormals

When a floating point number becomes too close to zero to represent even with the smallest exponent, most platforms support subnormal numbers, i.e. numbers that are not normalized to have the implicit decimal bit set (see above). These are represented by a no-bits-set exponent. The exponent is considered the same as ( 1 - bias ), and there is no implicit decimal bit. That means, the closer to zero the remaining mantissa gets, the less bits of precision it holds.

Unnormals

The 80bit Intel x86 Extended Precision format has an explicit decimal bit, which originally allowed for a number of additional combinations, like “pseudo subnormals” with a no-bits-set exponent but a set decimal bit, which were evaluated the same as subnormals. There were also unnormal numbers, with a non-zero exponent but a zero decimal bit. These were evaluated at an exponent one lower than subnormals, providing some additional precision for very small values.

While the explicit decimal bit still exists in the architecture, the corresponding special combinations have last been used by the 80286. From the 80386 onward, these are not generated / not accepted by the CPU. While PDCLib aims at maximum portability, I decided not to aim for 80286 compatibility for the generic implementation.

In PDCLib

The general idea for PDCLib is as follows:

  • When passed to printf(), a floating point value is sent to either _PDCLIB_bigint_from_dbl() or _PDCLIB_bigint_from_ldbl() for “deconstruction” into its components: Sign, exponent, and mantissa.
    • There is no need for _PDCLIB_bigint_from_flt(), since float gets promoted to double when passed through the variable argument list of printf().
  • This deconstruction makes use of bit-twiddling macros defined in _PDCLIB_config.h as appropriate for the platform.
  • The remaining conversion code works on the _PDCLIB_bigint_t holding the deconstructed floating point value, without needing to know its original composition.
  • For now, I limited myself to base-2 floating points. Other formats (like base-8, or base-16) could probably be added as needed.

Dragon4

Now that we have learned a bit about what a floating point number looks like, we will focus on how to convert one to a decimal representation.

The seminal work in this area is a paper by Guy L. Steele Jr. and Jon L. White, How to Print Floating-Point Numbers Accurately. They had worked out an algorithm they had named Dragon 4, which solved the problem. Years later, Robert G. Burger and R. Kent Dybvig, Printing Floating-Point Numbers Quickly and Accurately improved the performance of Dragon 4 significantly.

I am aware of later optimizations, like Printing Floating-Point Numbers Quickly and Accurately with Integers by Florian Loitsch, but since that requires a fallback mechanism for cases when the improved algorithm fails, I opted for the Burger & Dybvig variant.

Overview

Dragon 4 allows to find the shortest decimal number that uniquely identifies the binary number.

Given the limited precision of the mantissa, each representable binary value has exactly one predecessor, and one successor. If we have printed enough decimal digits to unambiguously identify this binary value, we can (and should!) stop printing further digits, as those no longer signify any usable precision.

Let's have a look at such a triple of consecutive numbers, for the sake of brevity in single precision format:

  • 0x4040 0001 – 3.0000002384185791015625
  • 0x4040 0002 – 3.000000476837158203125
  • 0x4040 0003 – 3.0000007152557373046875

You will see that everything after the seventh fractional digit is, effectively, useless. The decimal number 3.0000004 uniquely identifies the binary value 0x4040 0002, and is all that should be printed, even if the user requested a precision of 20 digits.

The initial reflex might be, “but what when I need those additional digits?”. The simple answer is, you really don't. You already have no way to distinguish whether the value you're looking at really was …047683… or, for example, …039871…, or …058812…. If you think you need those digits, you should not be using a float.

How does it do it?

Looked at from low earth orbit, Dragon4 works something like this[1]:

  • Copy the mantissa part of the value into a big integer.
  • If the value is not a subnormal, set the implied decimal.
  • As you don't have a decimal point in there anywhere, correct the exponent by the number of mantissa bits (basically, move the decimal point out of the bit pattern).
  • You set up a “mask” value, one that signifies half the distance between the value and its predecessor / successor.[2]
  • { stuff not yet understood here }
  • The mantissa bigint now being the numerator / dividend, find the highest power of ten that is still less than the mantissa bigint, as denominator / divisor.
  • Begin loop:
    • Divide mantissa bigint numerator by power-of-ten denominator. Result is in the range 0..9 (1..9 for the first round), and your decimal digit.
    • Substract denominator * quotient from mantissa.
    • Multiply mantissa and mask value by 10.
    • Repeat until a) mantissa is zero, b) specified precision is reached, c) mantissa < mask value

[1]: I am not done either understanding nor implementing the Dragon4 algorithm itself, so what I am writing here is very much work in progress.

[2]: There is a special case when the successor value would have a higher exponent, i.e. the successor would be twice as far away as the predecessor. You need to take this into account.

Visualization

Taking the hint from Ryan Juckett's tutorial on the subject, let's visualize what we're doing with a 6-bit floating point format, with 1 sign bit, 3 exponent bits, and 2 mantissa bits.

Binary Exponent Mantissa Value Margin
0 000 00 0 0 0 0.0625
0 000 01 0 1 0.0625 0.0625
0 000 10 0 2 0.125 0.0625
0 000 11 0 3 0.1875 0.0625
0 001 00 1 0 0.25 0.0625
0 001 01 1 1 0.3125 0.0625
0 001 10 1 2 0.375 0.0625
0 001 11 1 3 0.4375 0.0625
0 010 00 2 0 0.5 0.125
0 010 01 2 1 0.625 0.125
0 010 10 2 2 0.75 0.125
0 010 11 2 3 0.875 0.125
0 011 00 3 0 1 0.25
0 011 01 3 1 1.25 0.25
0 011 10 3 2 1.5 0.25
0 011 11 3 3 1.75 0.25
0 100 00 4 0 2 0.5
0 100 01 4 1 2.5 0.5
0 100 10 4 2 3 0.5
0 100 11 4 3 3.5 0.5
0 101 00 5 0 4 1
0 101 01 5 1 5 1
0 101 10 5 2 6 1
0 101 11 5 3 7 1
0 110 00 6 0 8 2
0 110 01 6 1 10 2
0 110 10 6 2 12 2
0 110 11 6 3 14 -
0 111 00 7 0 INF -
0 111 01 7 1 NaN(*) -
0 111 10 7 2 NaN -
0 111 11 7 3 NaN -

*: Signalling NaN (highest mantissa bit zero)

The “Margin” is the difference between that number and its next higher “neighbor”. Half that distance is where a decimal representation would be “tied” between those two binary representations.

Take 0.25 and 0.3125 for example, which have a margin of 0.0625. Half that margin would be 0.03125. The decimal number (0.25 + 0.03125) = 0.28125 would be tied beween 0.25 and 0.3125. But 0.28124 would unambiguously identify the binary 0 001 00, because that's closer. This is how strtod() and scanf() can make use of this margin number.

This works the other way around, too. Let's look at 0.4375. I don't need to print 0.4375 to unambiguously identify the binary 0 001 11, because either 0.43 or 0.44 would suffice (being less than 0.03125 away from the “real” value). Just 0.4 wouldn't do, because (0.4375 - 0.4) > 0.03125, and thus closer to 0.375 (0 001 10). This is the way printf() is looking at the issue.

pdclib/printing_floating_point_numbers.txt · Last modified: by solar

Except where otherwise noted, content on this wiki is licensed under the following license: CC0 1.0 Universal
CC0 1.0 Universal Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki