pdclib:floatingpoint
Differences
This shows you the differences between two versions of the page.
— | pdclib:floatingpoint [2019/10/21 10:13] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Floating Point Explained ====== | ||
+ | As an extension to any internal documentation of PDCLib, and because being able to //explain// a subject is a good test whether I myself have // | ||
+ | |||
+ | ===== Base-10 Basics ===== | ||
+ | |||
+ | We will take a short detour through numeric notations to lay out the terminology. | ||
+ | |||
+ | ==== Decimal Notation ==== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | ^ Digit ^ 1 ^ 2 ^ 3 ^ . ^ 4 ^ | ||
+ | ^ Value | < | ||
+ | |||
+ | ==== Scientific Notation ==== | ||
+ | |||
+ | Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point -- very large, or very small numbers. For these, you would use [[https:// | ||
+ | |||
+ | The < | ||
+ | |||
+ | Let's return to our number < | ||
+ | |||
+ | Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write < | ||
+ | |||
+ | This allows us to write very large numbers like < | ||
+ | |||
+ | ==== Normalized Number ==== | ||
+ | |||
+ | The two numbers in the previous paragraph were written so that there was only one significant digit in front of the decimal point. This is called a [[https:// | ||
+ | |||
+ | ===== Base-2 ===== | ||
+ | |||
+ | Now that we had a look at // | ||
+ | |||
+ | ==== Binary Encoding ==== | ||
+ | |||
+ | First thing, the //base// used by a computer is (usually) 2 instead of 10, for obvious reasons. It is implied, and does not need to be stored. | ||
+ | |||
+ | One bit is used as //sign bit// for the mantissa (i.e. whether the whole number is positive or negative). | ||
+ | |||
+ | A number of bits is set apart for the mantissa, and a number of bits is set apart for the exponent. | ||
+ | |||
+ | For easier visualization, | ||
+ | |||
+ | === Sign === | ||
+ | |||
+ | Unset indicates positive, set negative. Yes, this format has a [[https:// | ||
+ | |||
+ | === Mantissa === | ||
+ | |||
+ | The mantissa is assumed to be // | ||
+ | |||
+ | Most people who read thus far can probably rattle down the values for // | ||
+ | |||
+ | === Exponent === | ||
+ | |||
+ | The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as < | ||
+ | |||
+ | For our 3-bit exponent, we get a bias of < | ||
+ | |||
+ | ^ Exponent ^ Value ^ | ||
+ | ^ 000 | reserved | | ||
+ | ^ 001 | -2 | | ||
+ | ^ 010 | -1 | | ||
+ | ^ 011 | 0 | | ||
+ | ^ 100 | 1 | | ||
+ | ^ 101 | 2 | | ||
+ | ^ 110 | 3 | | ||
+ | ^ 111 | reserved | | ||
+ | |||
+ | ==== Exceptions ==== | ||
+ | |||
+ | === Infinity === | ||
+ | |||
+ | If all exponent bits are set and all mantissa bits are cleared, that indicates an " | ||
+ | |||
+ | This usually happens as a result of overflow, or a division by zero. (**Note:** Division by zero does //not// give " | ||
+ | |||
+ | === Not a Number === | ||
+ | |||
+ | If all exponent bits are set and there are mantissa bits set, that indicates "not a number" | ||
+ | |||
+ | This can happen for certain operations like taking the square root of a negative value. | ||
+ | |||
+ | A processor might support //signaling NaNs//, which lead to a processor exception when used in an operation. To indicate whether a //NaN// is " | ||
+ | |||
+ | The sign bit does not matter for //NaN// values. | ||
+ | |||
+ | === Denormalized Numbers === | ||
+ | |||
+ | When all exponent bits are cleared, the mantissa is considered to be [[https:// | ||
+ | |||
+ | In other words, the smallest representable exponent (with only the lowest bit set) is applied, //with the usually implied pre-fractional bit unset//. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable. | ||
+ | |||
+ | Using our theoretical 6-bit format: | ||
+ | |||
+ | ^ Exponent ^ Mantissa (stored) ^ Mantissa (implied) ^ Value ^ | ||
+ | | 001 | 11 | 1.11 | < | ||
+ | | 001 | 10 | 1.10 | < | ||
+ | | 001 | 01 | 1.01 | < | ||
+ | | 001 | 00 | 1.00 | < | ||
+ | | 000 | 11 | **0**.11 | ||
+ | | 000 | 10 | **0**.10 | ||
+ | | 000 | 01 | **0**.01 | ||
+ | | 000 | 00 | **0**.00 | ||
+ | |||
+ | ==== Formats ==== | ||
+ | |||
+ | === IEEE 754 === | ||
+ | |||
+ | The [[https:// | ||
+ | |||
+ | ^ Format | ||
+ | ^ Overall Width | 32 bits | 64 bits | 128 bits | | ||
+ | ^ Significand | ||
+ | ^ Exponent | ||
+ | ^ Exponent Bias | 127 | 1023 | 16383 | | ||
+ | ^ E min | -126 | -1022 | -16382 | ||
+ | ^ E max | +127 | +1023 | +16383 | ||
+ | |||
+ | === x86 Extended Precision Format === | ||
+ | |||
+ | The [[https:// | ||
+ | |||
+ | The integral mantissa bit is zero for denormalized numbers and zero (obviously), | ||
+ | |||
+ | A special case is the pattern for non-signalling NaNs (all exponent bits set, integral and highest fractional mantissa bit set, //with all other mantissa bits cleared//). This is considered a " | ||
+ | |||
+ | ^ Format | ||
+ | ^ Overall Width | 80 bits | | ||
+ | ^ Significand | ||
+ | ^ Exponent | ||
+ | ^ Exponent Bias | 16383 | | ||
+ | ^ E min | -16382 | ||
+ | ^ E max | +16383 | ||
+ | |||
+ | ====== Dragon 4 and Grisu 3 ====== | ||
+ | |||
+ | The seminal works on the conversion of binary floating point to decimal string representation are: | ||
+ | |||
+ | * Guy Steele, Jon White (1990): [[https:// | ||
+ | * David Gay (1990): [[http:// | ||
+ | * Robert G. Burger, R. Kent Dybvig (1996): [[http:// | ||
+ | * Florian Loitsch (2010): [[http:// | ||
+ | |||
+ | Steele & White reference another seminal work on the opposite conversion, from decimal string to binary floating point representation: | ||
+ | |||
+ | * William D. Clinger (1990): [[https:// | ||
+ | |||
+ | That first paper by Steele & White presented the " | ||
+ | |||
+ | (Thanks to [[https:// | ||
+ | |||
+ | ===== Dragon ===== | ||
+ | |||
+ | Steele & White approach the presentation of their Dragon algorithm as a series of " | ||
+ | |||
+ | ==== Fixed Point Fraction Output ==== | ||
+ | |||
+ | Given: | ||
+ | |||
+ | * A positive value //f// less than 1, with //n// digits to (input) base //b// (usually 2). | ||
+ | |||
+ | Output: | ||
+ | |||
+ | * A value //F// of //N// digits to (output) base //B// (usually 10). | ||
+ | |||
+ | Such that: | ||
+ | |||
+ | - < | ||
+ | * The difference between representations is less than half the positional value of the //n//th digit of //f//. | ||
+ | - < | ||
+ | - < | ||
+ | * The difference between representations is no more than half the positional value of the //N//th digit of //F//. | ||
+ | - Each digit of //F// is output before the next is generated; no "back up" for corrections. | ||
+ | |||
+ | Algorithm < | ||
+ | |||
+ | * <m>k = 0, R = f, M = { b ^ { -n } / 2 }</m> | ||
+ | * while ( 1 ) | ||
+ | * k++ | ||
+ | * U = floor( R * B ) | ||
+ | * R = ( R * B ) % 1 | ||
+ | * M *= B | ||
+ | * if ( <m>R >= M AND <= 1 - M</m> ) | ||
+ | * append( F, U ) | ||
+ | * else | ||
+ | * break | ||
+ | * if ( <m>R <= 1/ | ||
+ | * append( F, U ) | ||
+ | * if ( <m>R >= 1/ | ||
+ | * append( F, U + 1 ) | ||
+ | |||
+ | At the end, < | ||
+ | |||
+ | Example: | ||
+ | |||
+ | * Given the base <m>b = 2</m> number <m>f = .10110</ | ||
+ | * The //exact// decimal representation would be < | ||
+ | * The next higher number (< | ||
+ | * The next lower number (< | ||
+ | * The Mask would be <m>M = { b ^ {-n} } / 2 = { 2 ^ { -5 } } / 2 = 0.015625</ | ||
+ | |||
+ | * First (<m>k = 1</ | ||
+ | * multiply <m>R = 0.6835</ | ||
+ | * multiply <m>M = 0.015625</ | ||
+ | * Fractional part < | ||
+ | * Second (<m>k = 2</ | ||
+ | * multiply <m>R = 0.835</ | ||
+ | * multiply <m>M = 0.15625</ | ||
+ | * Fractional part < | ||
+ | * Post-loop | ||
+ | * Fractional part < | ||
+ | * We have <m>N = k = 2</m> fractional digits in our result of < | ||
+ | |||
+ | ==== Floating-Point Printout ==== | ||
+ | |||
+ | Given: | ||
+ | |||
+ | * A base //b// floating-point number <m>v = f * b ^ (e - p)</ | ||
+ | |||
+ | Output: | ||
+ | |||
+ | * An approximate representation to base //B//, using exponential notation if the value is very large or very small. | ||
+ | |||
+ | Algorithm (Dragon2): | ||
+ | |||
+ | * Compute <m>v prime = v * B ^ {-x}</ | ||
+ | * Print the integer part of <m>v prime</ | ||
+ | * Print a decimal point | ||
+ | * Take the fractional part of <m>v prime</ | ||
+ | * Apply algorithm < | ||
+ | * If < |