Intro

A floating-point number is a computer representation of real numbers in binary. Floats typically use 32 bits and are stored in a form of binary scientific notation (sign, significand (mantissa), and exponent—encoded in binary). It allows for representing a very large range of values, but with limited precision, often leading to, for example, rounding errors.

Representation

Precision:
- single precision (32-bit, ~6-7 decimal digits)
- double precision (64-bit, ~16-17 decimal digits)
Structure (IEEE 754 single precision):”
- sign (1 bit)
- fraction/Mantissa (23 bits)- Stores the significant digits of the number (precision).
- exponent (8 bits)- scales mantissa by a power of 2

Precision and Rounding

Rounding Error: Not all decimal numbers can be represented exactly in binary, so floating-point arithmetic can introduce small errors.
Example: 0.1 = 1/10 → in binary, this is a repeating fraction:

0.1 (decimal) = 0.00011001100110011… (binary, repeating)

Overflow / underflow: Very large numbers exceed the max representable value; very small numbers fall below the min positive value and may become zero

Float variation

-Float Exposed

IEEE-754 FLOATING POINT REPRESENTATION OF VARIABLES MANTISSA EXPONENT PUNTO FLOTANTE S.A.

🧗‍♂️Random Restart

Explorer

Recent Notes

Sonar

Agentic AI

Claude Code

Floating-Point Number

Intro

Representation

Precision and Rounding

Float variation

Graph View

Table of Contents

Backlinks