Floating point

From Computer History Wiki
Revision as of 01:56, 14 November 2016 by Jnc (talk | contribs) (A reasonable start)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Floating point is a term used to describe computer support for real numbers; originally performed in software, it is now invariably done in hardware. Most implementations are fixed accuracy, and thus may truncate or round some results. It is possible to use floating point for irrational numbers as well, provided the user is prepared to accept the slight loss of accuracy.

Most hardware implementations of floating point uses three fields - a sign bit, an exponent, and a mantissa (fractional part); i.e. floating point numbers are represented in the form .X * 2^Y, where Y can be negative or positive.

Historically, different architectures devised their own floating point specifications (e.g. FP11 floating point), but there are now IEEE standards for floating point. The following formats (each with one sign bit, and exponent and mantissa bits as given) for storing floating point numbers exist:

  • Single-precision: 8+23
  • Double-precision: 11+52
  • Quadruple-precision: 15+112
  • Octuple-precision: 19+236

In all four, there is actually one more bit of mantissa than shown, because they are usually stored in 'normalized' form (i.e. the mantissa is constrained to be between 1/2 and 1, without any leading 0's to the right of the 'point'), and so there is always a '1' there, which is not stored.