What is the smallest DINOv2 model available?

The DINOv3 ViT-S/16 variant is approximately 86MB (FP32) with 21M parameters and produces 384-dimensional feature vectors. Quantized versions are substantially smaller, ranging from 21-91MB.

Does DINOv2 require WebGPU to run in the browser?

No. Both DINOv2 variants run through Transformers.js on WASM and work in all modern browsers, including Firefox and Safari, without requiring WebGPU.

What practical applications can I build with DINOv2 in the browser?

DINOv2 enables visual duplicate detection, style-based product recommendations, image clustering for photo organization, and quality assessment. Combined with LocalMode's VectorDB, you can build a complete visual search engine running entirely in the browser.

DINOv2 Image Features Models in the Browser

Q: What is the difference between DINOv2 and CLIP for image search?

DINOv2 features are purely visual, capturing what an image looks like without anchoring to text descriptions. This makes them better for finding visually similar images regardless of semantic content, such as matching lighting, texture patterns, or visual style. CLIP maps images and text into a shared space for cross-modal search.

Meta's self-supervised image feature extraction models - powerful visual representations without task-specific training.

Overview

The DINOv2 Image Features family is available through Transformers.js in LocalMode, with FP32 model sizes ranging from ~86–347 MB (quantized variants are substantially smaller). The primary task for these models is image-features, and they can be used with any application built on the LocalMode SDK.

Running DINOv2 Image Features models locally in the browser eliminates API costs, removes network latency, and keeps all user data on-device. After the initial model download, inference is instant and works offline. Each model variant targets a different trade-off between size, speed, and quality - choose based on your users' device capabilities and your application's requirements.

Architecture and History

DINOv2 (Distillation with No Labels v2) is Meta's self-supervised vision model that learns powerful visual representations from images alone - no labeled training data required. The resulting feature vectors capture rich semantic information about image content, composition, texture, and structure. These features can be used for image similarity, visual search, clustering, and as inputs to downstream classifiers.

The key advantage of DINOv2 over CLIP/SigLIP is that DINOv2 features are purely visual - they capture what an image looks like without anchoring to text descriptions. This makes them better for tasks like finding visually similar images regardless of semantic content: "show me photos with similar lighting," "find images with this texture pattern," or "cluster product photos by visual style."

DINOv2-base (ViT-B/14, patch size 14) produces 768-dimensional feature vectors from 224×224 input images, with 86M parameters. The dinov3 variant uses the ViT-S/16 architecture from Meta's officially released DINOv3 model (arXiv 2508.10104), offering a lighter alternative at 384 dimensions and 21M parameters. Both models run through Transformers.js on WASM and work in all modern browsers.

Practical applications include: visual duplicate detection (finding near-identical images in a collection), style-based recommendations (suggesting products with similar visual aesthetics), image clustering for photo organization, and quality assessment (comparing feature distributions between high and low quality images). Combined with LocalMode's VectorDB, you can build a complete visual search engine running entirely in the browser.

Variant Comparison

The following table lists every DINOv2 Image Features variant available through LocalMode, across all supported providers. Click a model ID to view its HuggingFace model card.

Model ID	Provider	Size	Speed	Quality	Context	Device
onnx-community/dinov2-base-ONNX	Transformers.js	347MB	Medium	High	768d	WASM
onnx-community/dinov3-vits16-pretrain-lvd1689m-ONNX	Transformers.js	86MB	Fast	Good	384d	WASM

Size Distribution

Size Range	Count
200MB–500MB	1	variant
Under 200MB	1	variant

How to choose a variant: Start with the smallest model that meets your quality requirements. For prototyping and development, use the fastest variant (smallest size, "Fast" speed tier). For production, test your specific use case against 2–3 variants and measure the quality difference against user expectations. In many applications, users cannot distinguish between "Good" and "High" quality tiers - the smaller model saves download time and memory.

Provider-Specific Code Examples

All DINOv2 Image Features variants use the same ImageFeatureModel interface from @localmode/core. Switching between providers requires changing only the import and model ID - no application logic changes.

Transformers.js

Transformers.js runs ONNX-optimized models via ONNX Runtime Web. WebGPU acceleration where available, WASM fallback otherwise.

import { transformers } from '@localmode/transformers';

const model = transformers.imageFeatures('onnx-community/dinov2-base-ONNX');
// Use the model with the corresponding @localmode/core function

Fallback Pattern

For maximum browser compatibility, wrap model loading in a try/catch: attempt the preferred model first, and fall back to a smaller variant if it fails to load.

import { transformers } from '@localmode/transformers';

// Try the preferred model, fall back to a smaller one on failure
let model;
try {
  model = transformers.imageFeatures('onnx-community/dinov2-base-ONNX');
} catch (error) {
  console.warn('Primary model failed, using fallback:', error);
  model = transformers.imageFeatures('onnx-community/dinov3-vits16-pretrain-lvd1689m-ONNX');
}

When to Use DINOv2 Image Features

DINOv2 Image Features models are a strong choice when:

You need image-features - DINOv2 Image Features is optimized for image-features tasks with models across multiple size tiers.
Browser compatibility matters - Available through 1 provider (transformers), ensuring coverage across Chrome, Firefox, Safari, and Edge.
Size flexibility is important - The ~86-347MB range (FP32) means you can target everything from mobile devices to high-end desktops with the same model family.

HuggingFace Model Cards

Methodology

The model data on this page - sizes, embedding dimensions, and provider availability - was verified against LocalMode's source code (packages/transformers/src/models.ts, packages/transformers/src/implementations/image-feature.ts) and directly against the HuggingFace ONNX model repositories. Model sizes reflect the FP32 model.onnx / model.onnx_data files as published; quantized variants (int8, q4, fp16) are substantially smaller (21–91 MB for these models). DINOv2 architecture details (ViT-B/14, 86M parameters, 768d; ViT-S/16, 21M parameters, 384d) were verified against the DINOv2 paper (arXiv:2304.07193) and the facebookresearch/dinov2 MODEL_CARD. DINOv3 details were verified against the official Meta DINOv3 paper (arXiv:2508.10104) and the facebook/dinov3-vits16-pretrain-lvd1689m model card. Performance characteristics (speed and quality tiers) are LocalMode's curated assessments based on parameter count and architecture. Always benchmark on your target devices before production deployment.

Sources

onnx-community/dinov2-base-ONNX - HuggingFace model card (FP32 model size, task, Transformers.js support)
facebook/dinov2-base - HuggingFace model card (86.6M parameters, source model)
facebookresearch/dinov2 MODEL_CARD.md - GitHub (all variant specs: ViT-S/14 21M/384d, ViT-B/14 86M/768d, ViT-L/14 300M/1024d, ViT-g/14 1.1B/1536d)
DINOv2: Learning Robust Visual Features without Supervision - arXiv:2304.07193 (architecture, patch size 14, embedding dimensions)
onnx-community/dinov3-vits16-pretrain-lvd1689m-ONNX - HuggingFace model card (FP32 model size ~86 MB)
facebook/dinov3-vits16-pretrain-lvd1689m - HuggingFace model card (official Meta DINOv3 ViT-S/16, 21M parameters, 384d)
DINOv3 - arXiv:2508.10104 (official Meta DINOv3 paper confirming ViT-S/16 variant)
DINOv3 - Meta AI Research (official Meta publication page)
LocalMode transformers provider source (imageFeatures API, IMAGE_FEATURE_MODELS catalog)
Transformers.js documentation

Frequently Asked Questions