pdclib:floatingpoint
Differences
This shows you the differences between two versions of the page.
| — | pdclib:floatingpoint [2019/10/21 10:13] (current) – created - external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Floating Point Explained ====== | ||
| + | As an extension to any internal documentation of PDCLib, and because being able to //explain// a subject is a good test whether I myself have // | ||
| + | |||
| + | ===== Base-10 Basics ===== | ||
| + | |||
| + | We will take a short detour through numeric notations to lay out the terminology. | ||
| + | |||
| + | ==== Decimal Notation ==== | ||
| + | |||
| + | [[https:// | ||
| + | |||
| + | ^ Digit ^ 1 ^ 2 ^ 3 ^ . ^ 4 ^ | ||
| + | ^ Value | < | ||
| + | |||
| + | ==== Scientific Notation ==== | ||
| + | |||
| + | Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point -- very large, or very small numbers. For these, you would use [[https:// | ||
| + | |||
| + | The < | ||
| + | |||
| + | Let's return to our number < | ||
| + | |||
| + | Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write < | ||
| + | |||
| + | This allows us to write very large numbers like < | ||
| + | |||
| + | ==== Normalized Number ==== | ||
| + | |||
| + | The two numbers in the previous paragraph were written so that there was only one significant digit in front of the decimal point. This is called a [[https:// | ||
| + | |||
| + | ===== Base-2 ===== | ||
| + | |||
| + | Now that we had a look at // | ||
| + | |||
| + | ==== Binary Encoding ==== | ||
| + | |||
| + | First thing, the //base// used by a computer is (usually) 2 instead of 10, for obvious reasons. It is implied, and does not need to be stored. | ||
| + | |||
| + | One bit is used as //sign bit// for the mantissa (i.e. whether the whole number is positive or negative). | ||
| + | |||
| + | A number of bits is set apart for the mantissa, and a number of bits is set apart for the exponent. | ||
| + | |||
| + | For easier visualization, | ||
| + | |||
| + | === Sign === | ||
| + | |||
| + | Unset indicates positive, set negative. Yes, this format has a [[https:// | ||
| + | |||
| + | === Mantissa === | ||
| + | |||
| + | The mantissa is assumed to be // | ||
| + | |||
| + | Most people who read thus far can probably rattle down the values for // | ||
| + | |||
| + | === Exponent === | ||
| + | |||
| + | The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as < | ||
| + | |||
| + | For our 3-bit exponent, we get a bias of < | ||
| + | |||
| + | ^ Exponent ^ Value ^ | ||
| + | ^ 000 | reserved | | ||
| + | ^ 001 | -2 | | ||
| + | ^ 010 | -1 | | ||
| + | ^ 011 | 0 | | ||
| + | ^ 100 | 1 | | ||
| + | ^ 101 | 2 | | ||
| + | ^ 110 | 3 | | ||
| + | ^ 111 | reserved | | ||
| + | |||
| + | ==== Exceptions ==== | ||
| + | |||
| + | === Infinity === | ||
| + | |||
| + | If all exponent bits are set and all mantissa bits are cleared, that indicates an " | ||
| + | |||
| + | This usually happens as a result of overflow, or a division by zero. (**Note:** Division by zero does //not// give " | ||
| + | |||
| + | === Not a Number === | ||
| + | |||
| + | If all exponent bits are set and there are mantissa bits set, that indicates "not a number" | ||
| + | |||
| + | This can happen for certain operations like taking the square root of a negative value. | ||
| + | |||
| + | A processor might support //signaling NaNs//, which lead to a processor exception when used in an operation. To indicate whether a //NaN// is " | ||
| + | |||
| + | The sign bit does not matter for //NaN// values. | ||
| + | |||
| + | === Denormalized Numbers === | ||
| + | |||
| + | When all exponent bits are cleared, the mantissa is considered to be [[https:// | ||
| + | |||
| + | In other words, the smallest representable exponent (with only the lowest bit set) is applied, //with the usually implied pre-fractional bit unset//. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable. | ||
| + | |||
| + | Using our theoretical 6-bit format: | ||
| + | |||
| + | ^ Exponent ^ Mantissa (stored) ^ Mantissa (implied) ^ Value ^ | ||
| + | | 001 | 11 | 1.11 | < | ||
| + | | 001 | 10 | 1.10 | < | ||
| + | | 001 | 01 | 1.01 | < | ||
| + | | 001 | 00 | 1.00 | < | ||
| + | | 000 | 11 | **0**.11 | ||
| + | | 000 | 10 | **0**.10 | ||
| + | | 000 | 01 | **0**.01 | ||
| + | | 000 | 00 | **0**.00 | ||
| + | |||
| + | ==== Formats ==== | ||
| + | |||
| + | === IEEE 754 === | ||
| + | |||
| + | The [[https:// | ||
| + | |||
| + | ^ Format | ||
| + | ^ Overall Width | 32 bits | 64 bits | 128 bits | | ||
| + | ^ Significand | ||
| + | ^ Exponent | ||
| + | ^ Exponent Bias | 127 | 1023 | 16383 | | ||
| + | ^ E min | -126 | -1022 | -16382 | ||
| + | ^ E max | +127 | +1023 | +16383 | ||
| + | |||
| + | === x86 Extended Precision Format === | ||
| + | |||
| + | The [[https:// | ||
| + | |||
| + | The integral mantissa bit is zero for denormalized numbers and zero (obviously), | ||
| + | |||
| + | A special case is the pattern for non-signalling NaNs (all exponent bits set, integral and highest fractional mantissa bit set, //with all other mantissa bits cleared//). This is considered a " | ||
| + | |||
| + | ^ Format | ||
| + | ^ Overall Width | 80 bits | | ||
| + | ^ Significand | ||
| + | ^ Exponent | ||
| + | ^ Exponent Bias | 16383 | | ||
| + | ^ E min | -16382 | ||
| + | ^ E max | +16383 | ||
| + | |||
| + | ====== Dragon 4 and Grisu 3 ====== | ||
| + | |||
| + | The seminal works on the conversion of binary floating point to decimal string representation are: | ||
| + | |||
| + | * Guy Steele, Jon White (1990): [[https:// | ||
| + | * David Gay (1990): [[http:// | ||
| + | * Robert G. Burger, R. Kent Dybvig (1996): [[http:// | ||
| + | * Florian Loitsch (2010): [[http:// | ||
| + | |||
| + | Steele & White reference another seminal work on the opposite conversion, from decimal string to binary floating point representation: | ||
| + | |||
| + | * William D. Clinger (1990): [[https:// | ||
| + | |||
| + | That first paper by Steele & White presented the " | ||
| + | |||
| + | (Thanks to [[https:// | ||
| + | |||
| + | ===== Dragon ===== | ||
| + | |||
| + | Steele & White approach the presentation of their Dragon algorithm as a series of " | ||
| + | |||
| + | ==== Fixed Point Fraction Output ==== | ||
| + | |||
| + | Given: | ||
| + | |||
| + | * A positive value //f// less than 1, with //n// digits to (input) base //b// (usually 2). | ||
| + | |||
| + | Output: | ||
| + | |||
| + | * A value //F// of //N// digits to (output) base //B// (usually 10). | ||
| + | |||
| + | Such that: | ||
| + | |||
| + | - < | ||
| + | * The difference between representations is less than half the positional value of the //n//th digit of //f//. | ||
| + | - < | ||
| + | - < | ||
| + | * The difference between representations is no more than half the positional value of the //N//th digit of //F//. | ||
| + | - Each digit of //F// is output before the next is generated; no "back up" for corrections. | ||
| + | |||
| + | Algorithm < | ||
| + | |||
| + | * <m>k = 0, R = f, M = { b ^ { -n } / 2 }</m> | ||
| + | * while ( 1 ) | ||
| + | * k++ | ||
| + | * U = floor( R * B ) | ||
| + | * R = ( R * B ) % 1 | ||
| + | * M *= B | ||
| + | * if ( <m>R >= M AND <= 1 - M</m> ) | ||
| + | * append( F, U ) | ||
| + | * else | ||
| + | * break | ||
| + | * if ( <m>R <= 1/ | ||
| + | * append( F, U ) | ||
| + | * if ( <m>R >= 1/ | ||
| + | * append( F, U + 1 ) | ||
| + | |||
| + | At the end, < | ||
| + | |||
| + | Example: | ||
| + | |||
| + | * Given the base <m>b = 2</m> number <m>f = .10110</ | ||
| + | * The //exact// decimal representation would be < | ||
| + | * The next higher number (< | ||
| + | * The next lower number (< | ||
| + | * The Mask would be <m>M = { b ^ {-n} } / 2 = { 2 ^ { -5 } } / 2 = 0.015625</ | ||
| + | |||
| + | * First (<m>k = 1</ | ||
| + | * multiply <m>R = 0.6835</ | ||
| + | * multiply <m>M = 0.015625</ | ||
| + | * Fractional part < | ||
| + | * Second (<m>k = 2</ | ||
| + | * multiply <m>R = 0.835</ | ||
| + | * multiply <m>M = 0.15625</ | ||
| + | * Fractional part < | ||
| + | * Post-loop | ||
| + | * Fractional part < | ||
| + | * We have <m>N = k = 2</m> fractional digits in our result of < | ||
| + | |||
| + | ==== Floating-Point Printout ==== | ||
| + | |||
| + | Given: | ||
| + | |||
| + | * A base //b// floating-point number <m>v = f * b ^ (e - p)</ | ||
| + | |||
| + | Output: | ||
| + | |||
| + | * An approximate representation to base //B//, using exponential notation if the value is very large or very small. | ||
| + | |||
| + | Algorithm (Dragon2): | ||
| + | |||
| + | * Compute <m>v prime = v * B ^ {-x}</ | ||
| + | * Print the integer part of <m>v prime</ | ||
| + | * Print a decimal point | ||
| + | * Take the fractional part of <m>v prime</ | ||
| + | * Apply algorithm < | ||
| + | * If < | ||
