Intro
DINOv2 is a self-supervised vision model developed by Meta (FAIR) that learns strong image Embeddings without labels. It
![]()
A family of foundation models producing universal features suitable for image-level visual tasks (image classification, instance retrieval, video understanding) as well as pixel-level visual tasks (depth estimation, semantic segmentation).
DINOv2 Variants
A common DINOv2 configuration for the 14-pixel patch size (e.g., ViT-B/14) includes the following approximate specifications:
Model Variant Parameters Layers (Depth) Embedding Dim (D) Attention Heads (H) ViT-S 21M 12 384 6 ViT-B 86M 18 768 12 ViT-L 304M 24 1024 16




