← Back to Models

Image Classification, Detection & Segmentation Models in the Browser

ViT for image classification, D-FINE for object detection, and SegFormer for semantic segmentation - all in the browser.

Image Classification, Detection & Segmentation Models in the Browser

ViT for image classification, D-FINE for object detection, and SegFormer for semantic segmentation - all in the browser.

Overview

The Image Classification, Detection & Segmentation family is available through Transformers.js in LocalMode, with model sizes ranging from 5MB–167MB. This family covers three primary tasks - image classification, object detection, and semantic segmentation - and all models can be used with any application built on the LocalMode SDK.

Running Image Classification, Detection & Segmentation models locally in the browser eliminates API costs, removes network latency, and keeps all user data on-device. After the initial model download, inference is instant and works offline. Each model variant targets a different trade-off between size, speed, and quality - choose based on your users' device capabilities and your application's requirements.

Architecture and History

LocalMode includes three complementary computer vision models that together cover the core image understanding tasks. Each is optimized for browser deployment - small enough to load quickly, accurate enough for production use.

ViT-Base-Patch16-224 (87MB) classifies images into 1,000 ImageNet categories. It's the vision equivalent of sentiment analysis - give it an image, get back ranked labels with confidence scores. Useful for auto-tagging photo libraries, content moderation, and image organization features.

D-FINE-N-COCO (quantized to ~4.5MB - the smallest model in the entire catalog) performs object detection, identifying and locating multiple objects within an image with bounding boxes. Trained on the COCO dataset (80 object categories including people, vehicles, animals, and household objects), it achieves 42.8 AP on COCO with just 4M parameters. The tiny model size makes it viable for real-time detection in video frames or camera feeds.

SegFormer-B0 (3.7M parameters, 37.4 mIoU on ADE20K) performs semantic segmentation - classifying every pixel in an image into categories like road, building, sky, person, and vegetation. This enables applications like background removal, scene understanding, and augmented reality overlays. All three models process images entirely client-side through Transformers.js.

Variant Comparison

The following table lists every Image Classification, Detection & Segmentation variant available through LocalMode, across all supported providers. Click a model ID to view its HuggingFace model card.

Model IDProviderSizeSpeedQualityContextDevice
Xenova/vit-base-patch16-224Transformers.js87MBMediumHigh-WASM
Xenova/deit-small-distilled-patch16-224Transformers.js88MBFastGood-WASM
Xenova/resnet-50Transformers.js98MBFastGood-WASM
onnx-community/dfine_n_coco-ONNXTransformers.js5MBFastGood-WASM
Xenova/detr-resnet-50Transformers.js167MBMediumHigh-WASM
Xenova/segformer-b0-finetuned-ade-512-512Transformers.js14MBFastGood-WASM

Size Distribution

Size RangeCount
Under 200MB6variants

How to choose a variant: Start with the smallest model that meets your quality requirements. For prototyping and development, use the fastest variant (smallest size, "Fast" speed tier). For production, test your specific use case against 2–3 variants and measure the quality difference against user expectations. In many applications, users cannot distinguish between "Good" and "High" quality tiers - the smaller model saves download time and memory.

Provider-Specific Code Examples

Each task in this family uses its own interface from @localmode/core: ImageClassificationModel, ObjectDetectionModel, and SegmentationModel. Switching between providers requires changing only the import and model ID - no application logic changes.

Transformers.js

Transformers.js runs ONNX-optimized models via ONNX Runtime Web. WebGPU acceleration where available, WASM fallback otherwise.

import { transformers } from '@localmode/transformers';

// Image classification
const classifier = transformers.imageClassifier('Xenova/vit-base-patch16-224');

// Object detection
const detector = transformers.objectDetector('onnx-community/dfine_n_coco-ONNX');

// Semantic segmentation
const segmenter = transformers.segmenter('Xenova/segformer-b0-finetuned-ade-512-512');

Fallback Pattern

For maximum browser compatibility, wrap model loading in a try/catch: attempt the preferred model first, and fall back to a smaller variant if it fails to load.

import { transformers } from '@localmode/transformers';

// Try the preferred model, fall back to a smaller one on failure
let model;
try {
  model = transformers.imageClassifier('Xenova/vit-base-patch16-224');
} catch (error) {
  console.warn('Primary model failed, using fallback:', error);
  model = transformers.imageClassifier('Xenova/resnet-50');
}

When to Use Image Classification, Detection & Segmentation

Image Classification, Detection & Segmentation models are a strong choice when:

  • You need browser-side computer vision - This family covers image classification (ViT, DeiT, ResNet), object detection (D-FINE-N, DETR), and semantic segmentation (SegFormer), all via the same @localmode/transformers provider.
  • Browser compatibility matters - Available through 1 provider (transformers), ensuring coverage across Chrome, Firefox, Safari, and Edge.
  • Size flexibility is important - The 5MB–167MB range (quantized) means you can target everything from mobile devices to high-end desktops. D-FINE-N at ~4.5MB is viable even on low-end hardware; DETR at 167MB (or ~43MB quantized) delivers the highest detection accuracy.

HuggingFace Model Cards

Methodology

Model availability and provider assignments were verified against LocalMode's source code at packages/transformers/src/models.ts (IMAGE_CLASSIFICATION_MODELS, OBJECT_DETECTION_MODELS, SEGMENTATION_MODELS). ONNX file sizes were verified directly from the HuggingFace repository file trees for each Xenova/onnx-community model. Parameter counts and benchmark scores (ViT ImageNet accuracy, D-FINE-N COCO AP, SegFormer ADE20K mIoU, DETR COCO AP) were verified against original model cards and peer-reviewed papers. Performance tiers (Fast/Medium, Good/High) are LocalMode's curated assessments based on architecture and quantized model size. Always benchmark on your target devices before production deployment.

Sources