LocalMode
Transformers

Image Captioning

Generate text descriptions of images with Florence-2.

Generate natural language descriptions of images using vision-language models like Florence-2.

For full API reference (captionImage(), options, result types, and custom providers), see the Core Vision guide.

See it in action

Try Image Captioner for a working demo.

ModelSizeSpeedUse Case
onnx-community/Florence-2-base-ft~460MB⚡⚡Captioning, OCR, detection, and document QA

File Upload Example

Based on the Image Captioner showcase app:

import { transformers } from '@localmode/transformers';
import { captionImage } from '@localmode/core';

const model = transformers.captioner('onnx-community/Florence-2-base-ft');

async function handleImageUpload(file: File) {
  const dataUrl = await new Promise<string>((resolve) => {
    const reader = new FileReader();
    reader.onload = () => resolve(reader.result as string);
    reader.readAsDataURL(file);
  });

  const { caption } = await captionImage({
    model,
    image: dataUrl,
    abortSignal: controller.signal,
  });

  return caption;
}

Image Input Formats

The image parameter accepts:

  • string — Data URL (data:image/jpeg;base64,...) or regular URL
  • Blob — Image blob from file input or fetch

Best Practices

Captioning Tips

  1. Use JPEG/PNG/WebP — These formats are well-supported
  2. Resize large images — Smaller images process faster with similar quality
  3. Cache the model — Load once, caption many images
  4. Handle errors — Invalid or corrupted images will throw

Showcase Apps

AppDescriptionLinks
Image CaptionerGenerate natural language descriptions of imagesDemo · Source

Next Steps

On this page