User Tools

Site Tools


pdclib:floatingpoint

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
pdclib:floatingpoint [2019/10/02 10:54]
solar [Floating Point Explained]
pdclib:floatingpoint [2019/10/18 14:40]
solar [Fixed Point Fraction Output]
Line 9: Line 9:
 ==== Decimal Notation ==== ==== Decimal Notation ====
  
-[[https://en.wikipedia.org/wiki/Decimal_notation | Decimal notation]] uses the digits ''0'' through ''9'', which carry value according to their position. Consider the decimal number ''123.4'':+[[https://en.wikipedia.org/wiki/Decimal_notation | Decimal notation]] uses the digits 0 through 9, which carry value according to their position. Consider the decimal number <m>123.4</m>:
  
 ^ Digit ^ 1 ^ 2 ^ 3 ^ . ^ 4 ^ ^ Digit ^ 1 ^ 2 ^ 3 ^ . ^ 4 ^
-^ Value | %%10^2%% %%10^1%% %%10^0%% | | %%10^-1%% |+^ Value | <m>*10^2</m> (100) <m>*10^1</m> (20) <m>*10^0</m> (3) | | <m>*10^-1</m> (0.4) |
  
 ==== Scientific Notation ==== ==== Scientific Notation ====
  
-Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point -- very large, or very small numbers. For these, you would use [[https://en.wikipedia.org/wiki/Scientific_notation | scientific notation]], where a number is written as ''10^n''.+Decimal notation gets more and more unwieldy the further your significant digits are away from the decimal point -- very large, or very small numbers. For these, you would use [[https://en.wikipedia.org/wiki/Scientific_notation | scientific notation]], where a number is written as <m>m*10^n</m>.
  
-The ''m'' is the [[https://en.wikipedia.org/wiki/Significand | significand]], also called "mantissa" (which is the more common term when talking about floating point numbers in computing), while ''n'' is the [[https://en.wikipedia.org/wiki/Exponent | exponent]] and ''10'' is the [[https://en.wikipedia.org/wiki/Base_(exponentiation) | base]].+The <m>m</m> is the [[https://en.wikipedia.org/wiki/Significand | significand]], also called "mantissa" (which is the more common term when talking about floating point numbers in computing), while <m>n</m> is the [[https://en.wikipedia.org/wiki/Exponent | exponent]] and <m>10</m> is the [[https://en.wikipedia.org/wiki/Base_(exponentiation) | base]].
  
-Let's return to our number ''123.4'', which could be written as ''123.4x10^0''. Any number to the zeroth power equals one, so multiplying by ''10^0'' is a no-op (and usually omitted, returning to normal //decimal notation//).+Let's return to our number <m>123.4</m>, which could be written as <m>123.4*10^0</m>. Any number to the zeroth power equals one, so multiplying by <m>10^0</m> is a no-op (and usually omitted, returning to normal //decimal notation//).
  
-Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write ''12.34x10^1''''1234x10^-1'', or ''1.234x10^2'' -- the decimal point is "floating" along the significand.+Increment or decrement the exponent to shift the decimal point of the significand one digit to the left or the right, respectively. So we could also write <m>12.34*10^1</m><m>1234*10^-1</m>, or <m>1.234*10^2</m> -- the decimal point is "floating" along the significand.
  
-This allows us to write very large numberslike ''1.234x10^78'', or very small numberslike ''1.234x10^-34'', efficiently by scaling the exponent appropriately.+This allows us to write very large numbers like <m>1.234*10^78</m>, or very small numbers like <m>1.234*10^-34</m> efficientlyby scaling the exponent appropriately.
  
 ==== Normalized Number ==== ==== Normalized Number ====
Line 36: Line 36:
 ==== Binary Encoding ==== ==== Binary Encoding ====
  
-First thing, the //base// used by a computer is ''2'' instead of ''10'', for obvious reasons. It is implied, and does not need to be stored.+First thing, the //base// used by a computer is (usually) 2 instead of 10, for obvious reasons. It is implied, and does not need to be stored.
  
 One bit is used as //sign bit// for the mantissa (i.e. whether the whole number is positive or negative). One bit is used as //sign bit// for the mantissa (i.e. whether the whole number is positive or negative).
Line 46: Line 46:
 === Sign === === Sign ===
  
-''0'' indicates positive, ''1'' negative. Yes, this format has a [[https://en.wikipedia.org/wiki/Signed_zero | signed zero]].+Unset indicates positive, set negative. Yes, this format has a [[https://en.wikipedia.org/wiki/Signed_zero | signed zero]].
  
 === Mantissa === === Mantissa ===
  
-The mantissa is assumed to be //normalized//, i.e. the highest (non-fractional) bit is assumed to be set. It has a value of ''2^0'', i.e. ''1'', with following fractional bits having incrementing negative exponents. Under this assumption, the highest bit need not actually be stored, giving room for an additional //fractional// mantissa bit at the end.+The mantissa is assumed to be //normalized//, i.e. the highest (non-fractional) bit is assumed to be set. It has a value of <m>2^0</m>, i.e. 1, with following fractional bits having incrementing negative exponents. Under this assumption, the highest bit need not actually be stored, giving room for an additional //fractional// mantissa bit at the end.
  
-Most people who read thus far can probably rattle down the values for //positive// base-2 exponents -- one, two, four, eight, sixteen and so on. //Negative// exponents usually give people a moment pause. But it's quite easy: While each positive increment //doubles// the value, each decrement //halves// it -- ''0.5''''0.25''''0.125''''0.0625'' and so on.+Most people who read thus far can probably rattle down the values for //positive// base-2 exponents -- one, two, four, eight, sixteen and so on. //Negative// exponents usually give people a moment pause. But it's quite easy: While each positive increment //doubles// the value, each decrement //halves// it -- <m>0.5</m><m>0.25</m><m>0.125</m><m>0.0625</m> and so on.
  
 === Exponent === === Exponent ===
  
-The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as ''1-2^(bits-1)''. "All bits clear" and "all bits set" are reserved (we will come to this later), so the smallest meaningful exponent is ''1''.+The exponent is a funny bugger. IEEE 754 defines the exponent as an absolute (positive) offset to a (negative) bias. That bias is calculated as <m>1-2^(bits-1)</m>. "All bits clear" and "all bits set" are reserved (we will come to this later), so the smallest meaningful exponent is 1.
  
-For our 3-bit exponent, we get a bias of ''1-2^2 = -3'', with a practical range of ''-2..3'':+For our 3-bit exponent, we get a bias of <m>1-2^2 = -3</m>, with a practical range of <m>-2..3</m>:
  
 ^ Exponent ^ Value ^ ^ Exponent ^ Value ^
Line 90: Line 90:
 === Denormalized Numbers === === Denormalized Numbers ===
  
-When all exponent bits are cleared, the mantissa is considered to be [[https://en.wikipedia.org/wiki/Denormal_number | denormalized]], with an exponent of ''1-(bias)''.+When all exponent bits are cleared, the mantissa is considered to be [[https://en.wikipedia.org/wiki/Denormal_number | denormalized]], with an exponent of <m>1-(bias)</m>.
  
 In other words, the smallest representable exponent (with only the lowest bit set) is applied, //with the usually implied pre-fractional bit unset//. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable. In other words, the smallest representable exponent (with only the lowest bit set) is applied, //with the usually implied pre-fractional bit unset//. The purpose, here, is to allow a bit of additional precision in the very smallest numbers presentable.
Line 97: Line 97:
  
 ^ Exponent ^ Mantissa (stored) ^ Mantissa (implied) ^ Value ^ ^ Exponent ^ Mantissa (stored) ^ Mantissa (implied) ^ Value ^
-| 001      | 11                | 1.11               %%1.75 2^(-2) = 0.4375%% +| 001      | 11                | 1.11               <m>1.75 2^(-2) = 0.4375</m> 
-| 001      | 10                | 1.10               %%1.50 2^(-2) = 0.375%%  | +| 001      | 10                | 1.10               <m>1.50 2^(-2) = 0.375</m>  | 
-| 001      | 01                | 1.01               %%1.25 2^(-2) = 0.3125%% +| 001      | 01                | 1.01               <m>1.25 2^(-2) = 0.3125</m> 
-| 001      | 00                | 1.00               %%1.00 2^(-2) = 0.25%%   | +| 001      | 00                | 1.00               <m>1.00 2^(-2) = 0.25</m>   | 
-| 000      | 11                | **0**.11               | **0**%%.75 2^(-2) = 0.1875%% +| 000      | 11                | **0**.11               | **0**<m>.75 2^(-2) = 0.1875</m> 
-| 000      | 10                | **0**.10               | **0**%%.50 2^(-2) = 0.125%%  | +| 000      | 10                | **0**.10               | **0**<m>.50 2^(-2) = 0.125</m>  | 
-| 000      | 01                | **0**.01               | **0**%%.25 2^(-2) = 0.0625%% |+| 000      | 01                | **0**.01               | **0**<m>.25 2^(-2) = 0.0625</m> |
 | 000      | 00                | **0**.00               | **0** | | 000      | 00                | **0**.00               | **0** |
  
Line 140: Line 140:
 The seminal works on the conversion of binary floating point to decimal string representation are: The seminal works on the conversion of binary floating point to decimal string representation are:
  
-  * Guy Steele, Jon White (1990): How to Print Floating-Point Numbers Accurately +  * Guy Steele, Jon White (1990): [[https://doi.org/10.1145/93548.93559 | How to Print Floating-Point Numbers Accurately]] 
-  * David Gay (1990): Correctly Rounded Binary-Decimal and Decimal-Binary Conversions +  * David Gay (1990): [[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4049 | Correctly Rounded Binary-Decimal and Decimal-Binary Conversions]] 
-  * Robert G. Burger, R. Kent Dybvig (1996): Printing Floating-Point Numbers Quickly and Accurately +  * Robert G. Burger, R. Kent Dybvig (1996): [[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.2247 | Printing Floating-Point Numbers Quickly and Accurately]] 
-  * Florian Loitsch (2010): Printing Floating-Point Numbers Quickly and Accurately with Integers+  * Florian Loitsch (2010): [[http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.1021.1906 | Printing Floating-Point Numbers Quickly and Accurately with Integers]]
  
-That first paper presented the fourth iteration of an algorithm dubbed "Dragon" by its inventorsi.e. "Dragon 4". Loitsch continued in that vein by calling his small-integer optimization "Grisu", of which there are three iterations.+Steele & White reference another seminal work on the opposite conversion, from decimal string to binary floating point representation: 
 + 
 +  * William D. Clinger (1990): [[https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.4152 | How to read floating point numbers accurately]] 
 + 
 +That first paper by Steele & White presented the "Dragon" algorithm in its fourth iterationto which Gay and Burger / Dybvig submitted improvements. Loitsch presented a significant performance improvement he called "Grisu", of which there are three iterations.
  
 (Thanks to [[https://www.ryanjuckett.com/programming/printing-floating-point-numbers/ | Ryan Juckett]] for writing a very nice introduction on the subject, which was at the root of what I assembled on this page.) (Thanks to [[https://www.ryanjuckett.com/programming/printing-floating-point-numbers/ | Ryan Juckett]] for writing a very nice introduction on the subject, which was at the root of what I assembled on this page.)
 +
 +===== Dragon =====
 +
 +Steele & White approach the presentation of their Dragon algorithm as a series of "lesser" algorithms building on each other.
 +
 +==== Fixed Point Fraction Output ====
 +
 +Given:
 +
 +  * A positive value //f// less than 1, with //n// digits to (input) base //b// (usually 2).
 +
 +Output:
 +
 +  * A value //F// of //N// digits to (output) base //B// (usually 10).
 +
 +Such that:
 +
 +  - <m>delim{|}{F - f}{|} < { b^{-n} } / 2</m>
 +    * The difference between representations is less than half the positional value of the //n//th digit of //f//.
 +  - <m>N</m> is the smallest number of digits so that 1. can be true.
 +  - <m>delim{|}{F - f}{|} <= { B^{-N} } / 2</m>
 +    * The difference between representations is no more than half the positional value of the //N//th digit of //F//.
 +  - Each digit of //F// is output before the next is generated; no "back up" for corrections.
 +
 +Algorithm <m>(FP)^3</m> (Finite-Precision Fixed-Point Fraction Printout):
 +
 +  * <m>k = 0, R = f, M = { b ^ { -n } / 2 }</m>
 +  * while ( 1 )
 +    * k++
 +    * U = floor( R * B )
 +    * R = ( R * B ) % 1
 +    * M *= B
 +    * if ( <m>R >= M AND <= 1 - M</m> )
 +      * append( F, U )
 +    * else
 +      * break
 +  * if ( <m>R <= 1/2</m> )
 +    * append( F, U )
 +  * if ( <m>R >= 1/2</m> )
 +    * append( F, U + 1 )
 +
 +At the end, <m>k</m> is <m>N</m>, i.e. the number of digits in <m>F</m>.
 +
 +Example:
 +
 +  * Given the base <m>b = 2</m> number <m>f = .10110</m>, with <m>n = 5</m>, to be converted to base <m>B = 10</m>
 +  * The //exact// decimal representation would be <m>.6835</m>
 +    * The next higher number (<m>f_{+1} = .10111</m>) would be decimal <m>0.71475</m>
 +    * The next lower number (<m>f_{-1} = .10101</m>) would be decimal <m>0.65225</m>
 +  * The Mask would be <m>M = { b ^ {-n} } / 2 = { 2 ^ { -5 } } / 2 = 0.015625</m>
 +
 +  * First (<m>k = 1</m>) loop
 +    * multiply <m>R = 0.6835</m> by <m>B = 10</m> for integral part <m>6</m>, fractional part <m>0.835</m>
 +    * multiply <m>M = 0.015625</m> by <m>B = 10</m> for new <m>M = 0.15625</m>
 +    * Fractional part <m>0.835</m> is larger than mask <m>0.15625</m>, and smaller than <m>1 - 0.15625 = 0.84375</m>, so <m>6</m> is our first (fractional) digit, and the loop continues
 +  * Second (<m>k = 2</m>) loop
 +    * multiply <m>R = 0.835</m> by <m>B = 10</m> for integral part <m>8</m>, fractional part <m>0.35</m>
 +    * multiply <m>M = 0.15625</m> by <m>B = 10</m> for new <m>M = 1.5625</m>
 +    * Fractional part <m>.35</m> is smaller than mask <m>1.5625</m>, and not smaller than <m>1 - 1.5625 = -0.5625</m>, so the loop terminates
 +  * Post-loop
 +    * Fractional part <m>.35</m> is smaller than <m>1/2</m>, so <m>8</m> is our next fractional digit
 +    * We have <m>N = k = 2</m> fractional digits in our result of <m>0.68</m>, which is the smallest <m>N</m> that uniquely identifies our original <m>f = .10110</m>
 +
 +==== Floating-Point Printout ====
 +
 +Given:
 +
 +  * A base //b// floating-point number <m>v = f * b ^ (e - p)</m>, where //e//, //f//, and //p// are integers with <m>p >= 0</m> and <m>0 <= f < b ^ p</m>.
 +
 +Output:
 +
 +  * An approximate representation to base //B//, using exponential notation if the value is very large or very small.
 +
 +Algorithm (Dragon2):
 +
 +   * Compute <m>v prime = v * B ^ {-x}</m>, where <m>x</m> is chosen so that the result either is between 1 / //B// and 1, or between 1 and //B//
 +   * Print the integer part of <m>v prime</m>
 +   * Print a decimal point
 +   * Take the fractional part of <m>v prime</m>, let <m>n = p - floor ( log_ b v prime ) - 1</m>
 +   * Apply algorithm <m>(FP)^3</m> to //f// and //n//
 +   * If <m>x</m> was not zero, append the letter "E" and a representation of the scale factor <m>x</m>
pdclib/floatingpoint.txt · Last modified: 2019/10/21 10:13 by solar