Intro
A Vision Transformer (ViT) is a Transformer model designed for Computer Vision tasks like Image Classification. A ViT treats an image like a sequence of “visual words” (patches) and then uses self-attention instead of convolutional layers like in CNNs to process images.
![]()
The architecture of vision transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.




