Intro

UTF-8 (Unicode Transformation Format - 8 bits) is a variable-length character encoding (1 - 4 bytes). UTF-8 is backwards-compatible with ASCII and is the preferred encoding for web pages.

UTF-8 Byte Examples

CharacterUnicodeUTF-8 Bytes (hex)
AU+004141
éU+00E9C3 A9
U+3042E3 81 82
😀U+1F600F0 9F 98 80
  • ASCII characters (U+0000 to U+007F) → 1 byte
  • Characters beyond ASCII → 2, 3, or 4 bytes depending on the code point
  • Unicode is a character set. It is a list where all characters have a unique number:
A = 65
B = 66
C = 67
D = 68

The decimal numbers that represent the string "hello":

h  e  l  l  o
104 101 108 108 111
Binary (UTF-8): 01101000 01100101 01101100 01101100 01101111
  • UTF-8 is an encoding. It is how Unicode numbers are translated into bytes for storage and transmission.