Intro

DINOv2 is a self-supervised vision model developed by Meta (FAIR) that learns strong image Embeddings without labels. It

img

A family of foundation models producing universal features suitable for image-level visual tasks (image classification, instance retrieval, video understanding) as well as pixel-level visual tasks (depth estimation, semantic segmentation).

DINOv2 Variants

A common DINOv2 configuration for the 14-pixel patch size (e.g., ViT-B/14) includes the following approximate specifications:

Model VariantParametersLayers (Depth)Embedding Dim (D)Attention Heads (H)
ViT-S21M123846
ViT-B86M1876812
ViT-L304M24102416