Intro

A Vision Transformer (ViT) is a Transformer model designed for Computer Vision tasks like Image Classification. A ViT treats an image like a sequence of “visual words” (patches) and then uses self-attention instead of convolutional layers like in CNNs to process images.

img

The architecture of vision transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.