Floating point

From Computer History Wiki
Revision as of 17:34, 17 March 2024 by Jnc (talk | contribs) (Link normalized)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Floating point is a term used to describe computer support for real numbers; originally performed in software, it is now invariably done in hardware, often in a special floating point processor.

Most implementations are fixed accuracy, and thus may truncate or round some results. It is possible to use floating point for irrational numbers as well, provided the user is prepared to accept the slight loss of accuracy.

Most hardware implementations of floating point uses three fields - a sign bit, an exponent, and a mantissa (fractional part); i.e. floating point numbers are represented in the form .X * 2^Y, where Y can be negative or positive.

Historically, different architectures devised their own floating point specifications (e.g. FP11 floating point), but there are now IEEE standards for floating point. The following formats (each with one sign bit, and exponent and mantissa bits as given) for storing floating point numbers exist:

  • Single-precision: 8+23
  • Double-precision: 11+52
  • Quadruple-precision: 15+112
  • Octuple-precision: 19+236

In all four, there is actually one more bit of mantissa than shown, because they are usually stored in normalized form (i.e. the mantissa is constrained to be between 1/2 and 1, without any leading 0's to the right of the 'point'), and so there is always a '1' there, which is not stored.

External links