Intro

A Vision Transformer (ViT) is a Transformer model designed for Computer Vision tasks like Image Classification. A ViT treats an image like a sequence of “visual words” (patches) and then uses self-attention instead of convolutional layers like in CNNs to process images.

Vision Transformer

The architecture of vision transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.

🧗‍♂️Random Restart

Explorer

Recent Notes

Qix (Software Build System)

RobotX Software

UCRT (Unmanned Collaborative Research Testbed)

Vision Transformer (ViT)

Intro

Graph View

Backlinks