What is the best model for image classification in the browser?

Xenova/vit-base-patch16-224 (~88MB) offers the best quality with 1,000 ImageNet categories. For the smallest download, MediaPipe EfficientNet-Lite0 (~19MB) provides good accuracy at a fraction of the size.

How large is the model download for browser image classification?

Models range from ~19MB (MediaPipe EfficientNet-Lite0) to ~88MB (ViT-base). DeiT-small (~88MB) and ResNet-50 (~98MB) offer a middle ground between size and quality.

What browsers support image classification with LocalMode?

Transformers.js models work in all modern browsers via WebGPU or WASM backends (Chrome 80+, Edge 80+, Firefox 75+, Safari 14+). MediaPipe uses WASM + WebGL and does not require WebGPU.

Does browser image classification work offline?

Yes. After the initial one-time model download, image classification runs entirely in the browser with no internet connection required. Images never leave the device.

How does local image classification cost compare to cloud APIs?

Google Cloud Vision costs $1.50 per 1,000 images and AWS Rekognition costs $1.00 per 1,000 images. LocalMode costs $0 after the initial model download of 19-88MB depending on model choice.

Image Classification in the Browser

Classify images into categories using ViT - identify objects, scenes, and content types in the browser.

What Is Image Classification?

Image classification assigns one or more labels to an image from a predefined set of categories. ViT (Vision Transformer) processes images by dividing them into patches, encoding each patch as a token, and applying transformer attention to classify the entire image. The model recognizes 1,000 ImageNet categories covering animals, objects, scenes, and activities.

This capability is exposed through the classifyImage() function in @localmode/core. All processing runs entirely in the browser - no server, no API key, no data leaves the device. After the initial model download, image classification works completely offline.

Real-World Applications

Photo library auto-tagging and organization. Content moderation (NSFW detection). Product image categorization for e-commerce. Quality control in manufacturing (defect detection). Wildlife camera trap classification. Medical image screening.

These use cases all benefit from local, on-device processing: user data stays private, there are no per-request API costs, and the application works without internet after initial setup.

Getting Started

Install the required packages:

npm install @localmode/core @localmode/transformers

Import the core function and provider:

import { classifyImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

The recommended starting model is Xenova/vit-base-patch16-224 - it provides the best balance of quality, speed, and download size for most applications.

Code Example

import { classifyImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.imageClassifier('Xenova/vit-base-patch16-224');

const { predictions } = await classifyImage({
  model,
  image: imageFile, // File, Blob, or URL
});

// predictions: [
//   { label: 'golden retriever', score: 0.92 },
//   { label: 'Labrador retriever', score: 0.05 },
//   { label: 'tennis ball', score: 0.01 },
// ]

This example demonstrates the core workflow: create a model instance from the provider, call the classifyImage() function with your input, and receive structured results. The same pattern works identically across both available providers: Transformers.js and MediaPipe.

Available Models

The following models support image classification through LocalMode. Choose based on your target device, acceptable download size, and quality requirements. Sizes shown are for the default ONNX variant or the provider's default file.

Model	Provider	Size	Speed	Quality
Xenova/vit-base-patch16-224	Transformers.js	~88MB	Medium	High
Xenova/deit-small-distilled-patch16-224	Transformers.js	~88MB	Fast	Good
Xenova/resnet-50	Transformers.js	~98MB	Fast	Good
EfficientNet-Lite0 (image_classifier)	MediaPipe	~19MB	Fast	Good

Choosing a model: For most applications, start with the recommended model (Xenova/vit-base-patch16-224). If download size is the primary constraint (e.g., mobile PWA, browser extension), consider the MediaPipe EfficientNet-Lite0 (~19MB) or ResNet-50 (~98MB). If quality is the priority (e.g., enterprise search, content analysis), use the largest model your target devices can handle.

Cloud vs Local: Cost and Privacy Comparison

Running image classification locally eliminates per-request API costs and keeps all data on-device. Here is how the economics compare:

Service	Cost / Notes
Google Cloud Vision	$1.50 per 1000 images
AWS Rekognition	$1 per 1000 images
LocalMode image classification	$0 after one-time model download (~19–88MB depending on model)

Google Cloud Vision costs $1.50 per 1,000 images (for units 1,001–5,000,000/month; first 1,000/month free). AWS Rekognition DetectLabels costs $1.00 per 1,000 images for the first million images/month, with volume discounts beyond that. LocalMode image classification costs $0 after the initial model download. Images never leave the device - critical for medical, personal, and sensitive content.

The break-even point for most applications is low: if you process more than a few hundred requests per day, local inference costs less than any cloud API within the first week. For privacy-sensitive applications (medical records, legal documents, financial data), the cost comparison is secondary - the ability to process data without it ever leaving the device is the primary value.

Available Providers

Transformers.js - ONNX-optimized models via ONNX Runtime Web. Supports both WebGPU and WASM backends. Broadest model catalog for non-LLM tasks, including ViT, DeiT, and ResNet.
MediaPipe - Google's on-device ML runtime using WASM + WebGL (no WebGPU required). Ships EfficientNet-Lite0 (~19MB) for general ImageNet classification with 1,000 classes. Use mediapipe.imageClassifier() in place of transformers.imageClassifier().

AbortSignal Support

All classifyImage() calls support cancellation through the standard AbortSignal API:

const controller = new AbortController();

const promise = classifyImage({
  model,
  image: imageFile,
  abortSignal: controller.signal,
});

// Cancel if needed (e.g., user navigates away)
controller.abort();

This is essential for responsive UIs - cancel in-flight operations when the user navigates away, submits a new query, or closes a dialog. The underlying model inference stops immediately, freeing memory and compute resources.

React Integration

If you are building a React application, @localmode/react provides hooks that manage loading states, error handling, and cancellation automatically:

npm install @localmode/react

import { useClassifyImage } from '@localmode/react';

The hook returns { data, error, isLoading, execute, cancel } - providing everything a UI component needs to display progress, handle errors, and offer cancellation.

Vision Models - model guide
Text Generation - task guide
Text Embeddings - task guide

Methodology

This guide is based on LocalMode's source code (packages/core/src/vision/classify-image.ts, packages/transformers/src/implementations/image-classifier.ts, packages/mediapipe/src/implementations/image-classifier.ts). Function names, hook names, and model IDs were verified directly against the codebase. Model sizes reflect the default quantized ONNX variant (model_quantized.onnx) as listed in each model's HuggingFace file tree; MediaPipe model sizes are from the verified catalog in packages/mediapipe/src/models.ts. Cloud pricing figures were taken directly from the Google Cloud Vision and AWS Rekognition pricing pages at the time of writing and are subject to change - verify current pricing with the provider before making cost decisions.

Image Classification in the Browser

Image Classification in the Browser

What Is Image Classification?

Real-World Applications

Getting Started

Code Example

Available Models

Cloud vs Local: Cost and Privacy Comparison

Available Providers

AbortSignal Support

React Integration

Methodology

Sources

Frequently Asked Questions