Vision
Image captioning, classification, detection, segmentation, and more.
LocalMode provides seven vision functions that run entirely in the browser — no server required. All accept an ImageInput (Blob, ImageData, string URL, or ArrayBuffer) and return structured results with usage metrics.
See it in action
Try Object Detector and Background Remover for working demos of these APIs.
| Function | Purpose |
|---|---|
captionImage() | Generate natural language descriptions |
classifyImage() | Classify into pre-trained categories |
classifyImageZeroShot() | Classify into arbitrary labels (CLIP/SigLIP) |
detectObjects() | Locate and label objects with bounding boxes |
segmentImage() | Produce pixel-level masks per region |
extractImageFeatures() | Extract feature vectors for similarity search |
imageToImage() | Super-resolution / image transformation |
captionImage()
Generate a natural language caption for an image:
import { captionImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.captioner('onnx-community/Florence-2-base-ft');
const { caption, usage, response } = await captionImage({
model,
image: imageBlob,
});
console.log(caption); // "a golden retriever playing with a ball in a park"
console.log(`Processed in ${usage.durationMs}ms`);const controller = new AbortController();
setTimeout(() => controller.abort(), 10000); // Cancel after 10s
const { caption } = await captionImage({
model,
image: imageBlob,
abortSignal: controller.signal,
});CaptionImageOptions
Prop
Type
CaptionImageResult
Prop
Type
classifyImage()
Classify an image into pre-trained categories:
import { classifyImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.imageClassifier('Xenova/vit-base-patch16-224');
const { predictions, usage } = await classifyImage({
model,
image: imageBlob,
topK: 5,
});
predictions.forEach((p) => {
console.log(`${p.label}: ${(p.score * 100).toFixed(1)}%`);
});
// golden retriever: 92.3%
// Labrador retriever: 4.1%
// ...ClassifyImageOptions
Prop
Type
ClassifyImageResult
Prop
Type
ImageClassificationResultItem
Prop
Type
classifyImageZeroShot()
Classify an image into arbitrary labels without fine-tuning, using models like CLIP or SigLIP:
import { classifyImageZeroShot } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.zeroShotImageClassifier('Xenova/siglip-base-patch16-224');
const { labels, scores } = await classifyImageZeroShot({
model,
image: imageBlob,
candidateLabels: ['cat', 'dog', 'bird', 'car', 'tree'],
});
console.log(`Top prediction: ${labels[0]} (${(scores[0] * 100).toFixed(1)}%)`);
// Top prediction: dog (87.2%)ClassifyImageZeroShotOptions
Prop
Type
ClassifyImageZeroShotResult
Prop
Type
detectObjects()
Detect and locate objects in an image with bounding boxes:
import { detectObjects } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.objectDetector('onnx-community/dfine_n_coco-ONNX');
const { objects, usage } = await detectObjects({
model,
image: imageBlob,
threshold: 0.7,
});
for (const obj of objects) {
console.log(`${obj.label} (${(obj.score * 100).toFixed(1)}%)`);
console.log(` Box: x=${obj.box.x}, y=${obj.box.y}, ${obj.box.width}x${obj.box.height}`);
}
// person (95.2%)
// Box: x=120, y=45, 200x380
// dog (88.7%)
// Box: x=350, y=210, 150x170DetectObjectsOptions
Prop
Type
DetectObjectsResult
Prop
Type
DetectedObject
Prop
Type
segmentImage()
Segment an image into pixel-level regions:
import { segmentImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.segmenter('briaai/RMBG-1.4');
const { masks, usage } = await segmentImage({
model,
image: imageBlob,
});
for (const mask of masks) {
console.log(`${mask.label}: ${(mask.score * 100).toFixed(1)}%`);
}
// foreground: 97.8%
// background: 96.2%SegmentImageOptions
Prop
Type
SegmentImageResult
Prop
Type
SegmentMask
Prop
Type
extractImageFeatures()
Extract a feature vector from an image for similarity search, clustering, or reverse image lookup:
import { extractImageFeatures, cosineSimilarity } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.imageFeatures('Xenova/siglip-base-patch16-224');
const { features: features1 } = await extractImageFeatures({
model,
image: image1,
});
const { features: features2 } = await extractImageFeatures({
model,
image: image2,
});
const similarity = cosineSimilarity(features1, features2);
console.log(`Image similarity: ${(similarity * 100).toFixed(1)}%`);ExtractImageFeaturesOptions
Prop
Type
ExtractImageFeaturesResult
Prop
Type
imageToImage()
Transform an image using super-resolution or other image-to-image models. This is an alias for upscaleImage().
import { imageToImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.imageToImage('Xenova/swin2SR-lightweight-x2-64');
const { image, usage } = await imageToImage({
model,
image: lowResImage,
scale: 2,
});
console.log(`Upscaled in ${usage.durationMs}ms`);UpscaleImageOptions
Prop
Type
UpscaleImageResult
Prop
Type
Image Input Types
Supported image formats
All vision functions accept ImageInput, which is a union of four types:
Blob— File uploads,fetch()responses, canvas exportsImageData— Raw pixel data from<canvas>viagetImageData()string— A URL (data URI, object URL, or remote URL)ArrayBuffer— Raw binary image data
// From a file input
const blob: Blob = fileInput.files[0];
// From a canvas
const imageData: ImageData = ctx.getImageData(0, 0, width, height);
// From a URL
const url: string = 'https://example.com/photo.jpg';
// From fetch
const buffer: ArrayBuffer = await fetch(url).then((r) => r.arrayBuffer());Custom Provider
Implement the ImageCaptionModel interface to create a custom captioning provider. The other six vision interfaces (ImageClassificationModel, ZeroShotImageClassificationModel, ObjectDetectionModel, SegmentationModel, ImageFeatureModel, ImageToImageModel) follow the same pattern.
import type { ImageCaptionModel, DoCaptionImageOptions, DoCaptionImageResult } from '@localmode/core';
class MyCustomCaptioner implements ImageCaptionModel {
readonly modelId = 'custom:my-captioner';
readonly provider = 'custom';
async doCaption(options: DoCaptionImageOptions): Promise<DoCaptionImageResult> {
const { images, maxLength, abortSignal } = options;
// Your captioning logic here
const captions = images.map(() => 'A description of the image');
return {
captions,
usage: { durationMs: 0 },
};
}
}
// Use with core functions
const model = new MyCustomCaptioner();
const { caption } = await captionImage({ model, image: imageBlob });For recommended models, provider-specific options, and practical recipes, see the Transformers.js provider pages: Image Captioning, Image Classification, Zero-Shot Image, Object Detection, Image Segmentation, Image Features, and Image-to-Image.
Next Steps
Embeddings
Generate text embeddings for semantic search.
Vector Database
Store and search image feature vectors.
Transformers Provider
Browse all available vision models.
Showcase Apps
| App | Description | Links |
|---|---|---|
| Object Detector | Detect and label objects in images | Demo · Source |
| Image Captioner | Generate natural language image descriptions | Demo · Source |
| Background Remover | Segment and remove image backgrounds | Demo · Source |
| Photo Enhancer | Upscale and enhance photos with image-to-image models | Demo · Source |
| Duplicate Finder | Extract image features for duplicate detection | Demo · Source |
| Smart Gallery | Classify and organize photos by content | Demo · Source |
| Product Search | Visual product classification and search | Demo · Source |