Intro

A floating-point number is a computer representation of real numbers in binary. Floats typically use 32 bits and are stored in a form of binary scientific notation (sign, significand (mantissa), and exponent—encoded in binary). It allows for representing a very large range of values, but with limited precision, often leading to, for example, rounding errors.

Representation

  • Precision:
    • single precision (32-bit, ~6-7 decimal digits)
    • double precision (64-bit, ~16-17 decimal digits)
  • Structure (IEEE 754 single precision):”
    • sign (1 bit)
    • fraction/Mantissa (23 bits)- Stores the significant digits of the number (precision).
    • exponent (8 bits)- scales mantissa by a power of 2

Precision and Rounding

  • Rounding Error: Not all decimal numbers can be represented exactly in binary, so floating-point arithmetic can introduce small errors.
  • Example: 0.1 = 1/10 → in binary, this is a repeating fraction:
0.1 (decimal) = 0.00011001100110011… (binary, repeating)

  • Overflow / underflow: Very large numbers exceed the max representable value; very small numbers fall below the min positive value and may become zero

Float variation

-Float Exposed

description