LocalMode
Core

Vision

Image captioning, classification, detection, segmentation, and more.

LocalMode provides seven vision functions that run entirely in the browser — no server required. All accept an ImageInput (Blob, ImageData, string URL, or ArrayBuffer) and return structured results with usage metrics.

See it in action

Try Object Detector and Background Remover for working demos of these APIs.

FunctionPurpose
captionImage()Generate natural language descriptions
classifyImage()Classify into pre-trained categories
classifyImageZeroShot()Classify into arbitrary labels (CLIP/SigLIP)
detectObjects()Locate and label objects with bounding boxes
segmentImage()Produce pixel-level masks per region
extractImageFeatures()Extract feature vectors for similarity search
imageToImage()Super-resolution / image transformation

captionImage()

Generate a natural language caption for an image:

import { captionImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.captioner('onnx-community/Florence-2-base-ft');

const { caption, usage, response } = await captionImage({
  model,
  image: imageBlob,
});

console.log(caption); // "a golden retriever playing with a ball in a park"
console.log(`Processed in ${usage.durationMs}ms`);
const controller = new AbortController();

setTimeout(() => controller.abort(), 10000); // Cancel after 10s

const { caption } = await captionImage({
  model,
  image: imageBlob,
  abortSignal: controller.signal,
});

CaptionImageOptions

Prop

Type

CaptionImageResult

Prop

Type

classifyImage()

Classify an image into pre-trained categories:

import { classifyImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.imageClassifier('Xenova/vit-base-patch16-224');

const { predictions, usage } = await classifyImage({
  model,
  image: imageBlob,
  topK: 5,
});

predictions.forEach((p) => {
  console.log(`${p.label}: ${(p.score * 100).toFixed(1)}%`);
});
// golden retriever: 92.3%
// Labrador retriever: 4.1%
// ...

ClassifyImageOptions

Prop

Type

ClassifyImageResult

Prop

Type

ImageClassificationResultItem

Prop

Type

classifyImageZeroShot()

Classify an image into arbitrary labels without fine-tuning, using models like CLIP or SigLIP:

import { classifyImageZeroShot } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.zeroShotImageClassifier('Xenova/siglip-base-patch16-224');

const { labels, scores } = await classifyImageZeroShot({
  model,
  image: imageBlob,
  candidateLabels: ['cat', 'dog', 'bird', 'car', 'tree'],
});

console.log(`Top prediction: ${labels[0]} (${(scores[0] * 100).toFixed(1)}%)`);
// Top prediction: dog (87.2%)

ClassifyImageZeroShotOptions

Prop

Type

ClassifyImageZeroShotResult

Prop

Type

detectObjects()

Detect and locate objects in an image with bounding boxes:

import { detectObjects } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.objectDetector('onnx-community/dfine_n_coco-ONNX');

const { objects, usage } = await detectObjects({
  model,
  image: imageBlob,
  threshold: 0.7,
});

for (const obj of objects) {
  console.log(`${obj.label} (${(obj.score * 100).toFixed(1)}%)`);
  console.log(`  Box: x=${obj.box.x}, y=${obj.box.y}, ${obj.box.width}x${obj.box.height}`);
}
// person (95.2%)
//   Box: x=120, y=45, 200x380
// dog (88.7%)
//   Box: x=350, y=210, 150x170

DetectObjectsOptions

Prop

Type

DetectObjectsResult

Prop

Type

DetectedObject

Prop

Type

segmentImage()

Segment an image into pixel-level regions:

import { segmentImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.segmenter('briaai/RMBG-1.4');

const { masks, usage } = await segmentImage({
  model,
  image: imageBlob,
});

for (const mask of masks) {
  console.log(`${mask.label}: ${(mask.score * 100).toFixed(1)}%`);
}
// foreground: 97.8%
// background: 96.2%

SegmentImageOptions

Prop

Type

SegmentImageResult

Prop

Type

SegmentMask

Prop

Type

extractImageFeatures()

Extract a feature vector from an image for similarity search, clustering, or reverse image lookup:

import { extractImageFeatures, cosineSimilarity } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.imageFeatures('Xenova/siglip-base-patch16-224');

const { features: features1 } = await extractImageFeatures({
  model,
  image: image1,
});

const { features: features2 } = await extractImageFeatures({
  model,
  image: image2,
});

const similarity = cosineSimilarity(features1, features2);
console.log(`Image similarity: ${(similarity * 100).toFixed(1)}%`);

ExtractImageFeaturesOptions

Prop

Type

ExtractImageFeaturesResult

Prop

Type

imageToImage()

Transform an image using super-resolution or other image-to-image models. This is an alias for upscaleImage().

import { imageToImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.imageToImage('Xenova/swin2SR-lightweight-x2-64');

const { image, usage } = await imageToImage({
  model,
  image: lowResImage,
  scale: 2,
});

console.log(`Upscaled in ${usage.durationMs}ms`);

UpscaleImageOptions

Prop

Type

UpscaleImageResult

Prop

Type

Image Input Types

Supported image formats

All vision functions accept ImageInput, which is a union of four types:

  • Blob — File uploads, fetch() responses, canvas exports
  • ImageData — Raw pixel data from <canvas> via getImageData()
  • string — A URL (data URI, object URL, or remote URL)
  • ArrayBuffer — Raw binary image data
// From a file input
const blob: Blob = fileInput.files[0];

// From a canvas
const imageData: ImageData = ctx.getImageData(0, 0, width, height);

// From a URL
const url: string = 'https://example.com/photo.jpg';

// From fetch
const buffer: ArrayBuffer = await fetch(url).then((r) => r.arrayBuffer());

Custom Provider

Implement the ImageCaptionModel interface to create a custom captioning provider. The other six vision interfaces (ImageClassificationModel, ZeroShotImageClassificationModel, ObjectDetectionModel, SegmentationModel, ImageFeatureModel, ImageToImageModel) follow the same pattern.

import type { ImageCaptionModel, DoCaptionImageOptions, DoCaptionImageResult } from '@localmode/core';

class MyCustomCaptioner implements ImageCaptionModel {
  readonly modelId = 'custom:my-captioner';
  readonly provider = 'custom';

  async doCaption(options: DoCaptionImageOptions): Promise<DoCaptionImageResult> {
    const { images, maxLength, abortSignal } = options;

    // Your captioning logic here
    const captions = images.map(() => 'A description of the image');

    return {
      captions,
      usage: { durationMs: 0 },
    };
  }
}

// Use with core functions
const model = new MyCustomCaptioner();
const { caption } = await captionImage({ model, image: imageBlob });

For recommended models, provider-specific options, and practical recipes, see the Transformers.js provider pages: Image Captioning, Image Classification, Zero-Shot Image, Object Detection, Image Segmentation, Image Features, and Image-to-Image.

Next Steps

Showcase Apps

AppDescriptionLinks
Object DetectorDetect and label objects in imagesDemo · Source
Image CaptionerGenerate natural language image descriptionsDemo · Source
Background RemoverSegment and remove image backgroundsDemo · Source
Photo EnhancerUpscale and enhance photos with image-to-image modelsDemo · Source
Duplicate FinderExtract image features for duplicate detectionDemo · Source
Smart GalleryClassify and organize photos by contentDemo · Source
Product SearchVisual product classification and searchDemo · Source

On this page