LocalMode
Transformers

Overview

HuggingFace Transformers.js provider for browser-based ML inference.

@localmode/transformers

HuggingFace Transformers.js provider for LocalMode. Run ML models locally in the browser with WebGPU/WASM acceleration.

Features

  • 🚀 Browser-Native — Run ML models directly in the browser
  • 🔒 Privacy-First — All processing happens locally
  • 📦 Model Caching — Models cached in IndexedDB for instant subsequent loads
  • Optimized — Uses quantized models for smaller size and faster inference

Installation

bash pnpm install @localmode/transformers @localmode/core
bash npm install @localmode/transformers @localmode/core
bash yarn add @localmode/transformers @localmode/core
bash bun add @localmode/transformers @localmode/core

Quick Start

import { transformers } from '@localmode/transformers';
import { embed, rerank } from '@localmode/core';

// Text Embeddings
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const { embedding } = await embed({ model: embeddingModel, value: 'Hello world' });

// Reranking for RAG
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const { results } = await rerank({
  model: rerankerModel,
  query: 'What is machine learning?',
  documents: ['ML is a subset of AI...', 'Python is a language...'],
  topK: 5,
});

✅ Live Features

All features below are production-ready with implementations available.

Embeddings & Reranking

MethodInterfaceDescription
transformers.embedding(modelId)EmbeddingModelText embeddings
transformers.reranker(modelId)RerankerModelDocument reranking

Classification & NLP

FeatureMethodInterfaceDocs
Text Classificationtransformers.classifier(modelId)ClassificationModelGuide
Zero-Shot Classificationtransformers.zeroShot(modelId)ZeroShotClassificationModelGuide
Named Entity Recognitiontransformers.ner(modelId)NERModelGuide

Translation & Text Processing

FeatureMethodInterfaceDocs
Translationtransformers.translator(modelId)TranslationModelGuide
Summarizationtransformers.summarizer(modelId)SummarizationModelGuide
Fill-Masktransformers.fillMask(modelId)FillMaskModelGuide
Question Answeringtransformers.questionAnswering(modelId)QuestionAnsweringModelGuide

Audio

FeatureMethodInterfaceDocs
Speech-to-Texttransformers.speechToText(modelId)SpeechToTextModelGuide
Text-to-Speechtransformers.textToSpeech(modelId)TextToSpeechModelGuide

Vision

FeatureMethodInterfaceDocs
Image Captioningtransformers.captioner(modelId)ImageCaptionModelGuide
Object Detectiontransformers.objectDetector(modelId)ObjectDetectionModelGuide
Image Segmentationtransformers.segmenter(modelId)SegmentationModelGuide
Image Featurestransformers.imageFeatures(modelId)ImageFeatureModelGuide
Image-to-Imagetransformers.imageToImage(modelId)ImageToImageModelGuide
Image Classificationtransformers.imageClassifier(modelId)ImageClassificationModelGuide
Zero-Shot Image Classificationtransformers.zeroShotImageClassifier(modelId)ZeroShotImageClassificationModelGuide
OCRtransformers.ocr(modelId)OCRModelGuide
Document QAtransformers.documentQA(modelId)DocumentQAModelGuide

Click the Guide links in the tables above for detailed documentation, recommended models, and usage examples for each feature.

Model Options

Configure model loading:

const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
  quantized: true, // Use quantized model (smaller, faster)
  revision: 'main', // Model revision
  onProgress: (p) => {
    console.log(`Loading: ${(p.progress * 100).toFixed(1)}%`);
  },
});

Model Utilities

Manage model loading and caching:

import { preloadModel, isModelCached, getModelStorageUsage } from '@localmode/transformers';

// Check if model is cached
const cached = await isModelCached('Xenova/bge-small-en-v1.5');

// Preload model with progress
await preloadModel('Xenova/bge-small-en-v1.5', {
  onProgress: (p) => console.log(`${p.progress}% loaded`),
});

// Check storage usage
const usage = await getModelStorageUsage();

Custom Provider Instances

Use createTransformers() to create a provider instance with custom settings instead of the default singleton:

import { createTransformers } from '@localmode/transformers';

// Force WebGPU device
const gpuTransformers = createTransformers({
  device: 'webgpu',
  onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});

const model = gpuTransformers.embedding('Xenova/bge-small-en-v1.5');
// Offload inference to a Web Worker
const workerTransformers = createTransformers({
  useWorker: true,
});
OptionTypeDefaultDescription
device'webgpu' | 'wasm' | 'cpu' | 'auto''auto'Inference device
quantizedbooleanfalseUse quantized models
onProgress(progress) => voidModel loading progress callback
useWorkerbooleanfalseRun inference in a Web Worker

WebGPU Detection

Detect WebGPU availability for optimal device selection:

import { isWebGPUAvailable, getOptimalDevice } from '@localmode/transformers';

// Check if WebGPU is available
const webgpuAvailable = await isWebGPUAvailable();

if (webgpuAvailable) {
  console.log('WebGPU available, using GPU acceleration');
} else {
  console.log('Falling back to WASM');
}

// Get optimal device automatically
const device = await getOptimalDevice(); // 'webgpu' or 'wasm'

const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
  device, // Uses WebGPU if available, otherwise WASM
});

isWebGPUAvailable() vs isWebGPUSupported()

isWebGPUAvailable() from @localmode/transformers is a provider-specific check for this package.

isWebGPUSupported() from @localmode/core is a general capability detection function.

Both are async and check for a GPU adapter. Use the one from whichever package you're working with. See Capabilities for the full feature detection reference.

Browser Compatibility

BrowserWebGPUWASMNotes
Chrome 113+Best performance with WebGPU
Edge 113+Same as Chrome
FirefoxWASM only
Safari 18+WebGPU available
iOS SafariWebGPU available (iOS 26+)

Best Practices

Model Lifecycle — Singleton Caching

Model creation in @localmode/transformers triggers a download (first load) or cache read (subsequent loads). Always reuse model instances rather than creating new ones on every call:

import { transformers } from '@localmode/transformers';
import type { EmbeddingModel } from '@localmode/core';

// ✅ CORRECT: Create once, reuse everywhere
let embeddingModel: EmbeddingModel | null = null;

function getEmbeddingModel() {
  if (!embeddingModel) {
    embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
  }
  return embeddingModel;
}

// In your service functions
export async function embedText(text: string) {
  const model = getEmbeddingModel();
  return embed({ model, value: text });
}
// ❌ WRONG: Creating a new instance every call
export async function embedText(text: string) {
  const model = transformers.embedding('Xenova/bge-small-en-v1.5'); // Wasteful!
  return embed({ model, value: text });
}

Model creation is lightweight (it returns a lazy proxy), but keeping a single reference avoids redundant setup. This pattern is used across all 21 showcase apps that use @localmode/transformers.

WebGPU Device Detection

WebGPU provides GPU acceleration for 3-5x faster inference compared to WASM. Use device detection to automatically select the best backend:

import { transformers, isWebGPUAvailable } from '@localmode/transformers';

// Detect optimal device at app startup
const device = (await isWebGPUAvailable()) ? 'webgpu' : 'wasm';

// Pass device to model creation
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
  device,
  quantized: true,
});

This is especially valuable for compute-heavy tasks like embeddings, reranking, and speech processing. For lightweight tasks (classification, fill-mask), WASM performance is often sufficient.

Abort Error Handling

All @localmode functions support AbortSignal for cancellation. A clean abort pattern involves a custom error class in your service layer and proper handling in your hooks:

Service layer — Create and throw a recognizable abort error:

// _services/embedding.service.ts
export class EmbeddingAbortError extends Error {
  constructor() {
    super('Embedding was cancelled');
    this.name = 'EmbeddingAbortError';
  }
}

export async function generateEmbeddings(
  texts: string[],
  signal?: AbortSignal
) {
  try {
    return await embedMany({ model: getModel(), values: texts, abortSignal: signal });
  } catch (error) {
    if (error instanceof Error && error.name === 'AbortError') {
      throw new EmbeddingAbortError();
    }
    throw error;
  }
}

Hook layer — Manage the AbortController lifecycle and distinguish abort from real errors:

// _hooks/use-embedding.ts
export function useEmbedding() {
  const store = useEmbeddingStore();
  const controllerRef = useRef<AbortController | null>(null);

  const generate = async (texts: string[]) => {
    // Cancel any in-flight request
    controllerRef.current?.abort();
    controllerRef.current = new AbortController();

    store.setLoading(true);
    store.clearError();

    try {
      const result = await generateEmbeddings(texts, controllerRef.current.signal);
      store.setResult(result);
    } catch (error) {
      if (error instanceof EmbeddingAbortError) {
        return; // Silently ignore — user cancelled
      }
      store.setError(error instanceof Error ? error.message : 'Unknown error');
    } finally {
      store.setLoading(false);
    }
  };

  const cancel = () => controllerRef.current?.abort();

  return { generate, cancel };
}

Always abort the previous request before starting a new one. This prevents race conditions where an old response overwrites a newer one.

Vision (Image Input) — Experimental

Qwen3.5 ONNX models support vision input via their built-in vision encoder. Images are processed through AutoProcessor and fed to the model alongside text.

import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

// Qwen3.5 models have supportsVision: true
console.log(model.supportsVision); // true

const result = await streamText({
  model,
  prompt: '',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this image.' },
      { type: 'image', data: base64Data, mimeType: 'image/png' },
    ],
  }],
});

Vision-Capable ONNX Models

ModelSizeContextNotes
Qwen3.5 0.8B~500MB32KBest quality sub-1B multimodal
Qwen3.5 2B~1.5GB32KHigher quality, 4GB+ RAM
Qwen3.5 4B~2.5GB32KBest quality, 8GB+ RAM, WebGPU required

Experimental

Vision support uses Transformers.js v4 (preview release). The API may change in future TJS versions.

For full multimodal API reference including ContentPart types and utilities, see the Core Generation guide.

Performance Tips

Performance

  1. Use quantized models - Smaller and faster with minimal quality loss
  2. Preload models - Load during app init for instant inference
  3. Use WebGPU when available - 3-5x faster than WASM
  4. Batch operations - Process multiple inputs together
  5. Cache model instances - Use the singleton pattern above to avoid redundant setup

Next Steps

On this page