LocalMode
Transformers

Overview

HuggingFace Transformers.js provider for browser-based ML inference.

@localmode/transformers

HuggingFace Transformers.js provider for LocalMode. Run ML models locally in the browser with WebGPU/WASM acceleration.

Features

  • 🚀 Browser-Native — Run ML models directly in the browser
  • 🔒 Privacy-First — All processing happens locally
  • 📦 Model Caching — Models cached in IndexedDB for instant subsequent loads
  • Optimized — Uses quantized models for smaller size and faster inference

Installation

bash pnpm install @localmode/transformers @localmode/core
bash npm install @localmode/transformers @localmode/core
bash yarn add @localmode/transformers @localmode/core
bash bun add @localmode/transformers @localmode/core

Quick Start

import { transformers } from '@localmode/transformers';
import { embed, rerank } from '@localmode/core';

// Text Embeddings
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const { embedding } = await embed({ model: embeddingModel, value: 'Hello world' });

// Reranking for RAG
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const { results } = await rerank({
  model: rerankerModel,
  query: 'What is machine learning?',
  documents: ['ML is a subset of AI...', 'Python is a language...'],
  topK: 5,
});

✅ Live Features

All features below are production-ready with implementations available.

Embeddings & Reranking

MethodInterfaceDescription
transformers.embedding(modelId)EmbeddingModelText embeddings
transformers.reranker(modelId)RerankerModelDocument reranking

Classification & NLP

FeatureMethodInterfaceDocs
Text Classificationtransformers.classifier(modelId)ClassificationModelGuide
Zero-Shot Classificationtransformers.zeroShot(modelId)ZeroShotClassificationModelGuide
Named Entity Recognitiontransformers.ner(modelId)NERModelGuide

Translation & Text Processing

FeatureMethodInterfaceDocs
Translationtransformers.translator(modelId)TranslationModelGuide
Summarizationtransformers.summarizer(modelId)SummarizationModelGuide
Fill-Masktransformers.fillMask(modelId)FillMaskModelGuide
Question Answeringtransformers.questionAnswering(modelId)QuestionAnsweringModelGuide

Audio

FeatureMethodInterfaceDocs
Speech-to-Texttransformers.speechToText(modelId)SpeechToTextModelGuide
Text-to-Speechtransformers.textToSpeech(modelId)TextToSpeechModelGuide — phonemizer-backed, 29 voices, English (US & GB)

Vision

FeatureMethodInterfaceDocs
Image Captioningtransformers.captioner(modelId)ImageCaptionModelGuide
Object Detectiontransformers.objectDetector(modelId)ObjectDetectionModelGuide
Image Segmentationtransformers.segmenter(modelId)SegmentationModelGuide
Image Featurestransformers.imageFeatures(modelId)ImageFeatureModelGuide
Image-to-Imagetransformers.imageToImage(modelId)ImageToImageModelGuide
Image Classificationtransformers.imageClassifier(modelId)ImageClassificationModelGuide
Zero-Shot Image Classificationtransformers.zeroShotImageClassifier(modelId)ZeroShotImageClassificationModelGuide
OCRtransformers.ocr(modelId)OCRModelGuide
Document QAtransformers.documentQA(modelId)DocumentQAModelGuide

Click the Guide links in the tables above for detailed documentation, recommended models, and usage examples for each feature.

Model Options

Configure model loading:

const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
  quantized: true, // Use quantized model (smaller, faster)
  revision: 'main', // Model revision
  onProgress: (p) => {
    console.log(`Loading: ${(p.progress * 100).toFixed(1)}%`);
  },
});

Model Utilities

Manage model loading and caching:

import { preloadModel, isModelCached, getModelStorageUsage } from '@localmode/transformers';

// Check if model is cached
const cached = await isModelCached('Xenova/bge-small-en-v1.5');

// Preload model with progress
await preloadModel('Xenova/bge-small-en-v1.5', {
  onProgress: (p) => console.log(`${p.progress}% loaded`),
});

// Check storage usage
const usage = await getModelStorageUsage();

Custom Provider Instances

Use createTransformers() to create a provider instance with custom settings instead of the default singleton:

import { createTransformers } from '@localmode/transformers';

// Force WebGPU device
const gpuTransformers = createTransformers({
  device: 'webgpu',
  onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});

const model = gpuTransformers.embedding('Xenova/bge-small-en-v1.5');
// Offload inference to a Web Worker
const workerTransformers = createTransformers({
  useWorker: true,
});
OptionTypeDefaultDescription
device'webgpu' | 'wasm' | 'cpu' | 'auto''auto'Inference device
quantizedbooleanfalseUse quantized models
onProgress(progress) => voidModel loading progress callback
useWorkerbooleanfalseRun inference in a Web Worker

WebGPU Detection

Detect WebGPU availability for optimal device selection:

import { isWebGPUAvailable, getOptimalDevice } from '@localmode/transformers';

// Check if WebGPU is available
const webgpuAvailable = await isWebGPUAvailable();

if (webgpuAvailable) {
  console.log('WebGPU available, using GPU acceleration');
} else {
  console.log('Falling back to WASM');
}

// Get optimal device automatically
const device = await getOptimalDevice(); // 'webgpu' or 'wasm'

const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
  device, // Uses WebGPU if available, otherwise WASM
});

isWebGPUAvailable() vs isWebGPUSupported()

isWebGPUAvailable() from @localmode/transformers is a provider-specific check for this package.

isWebGPUSupported() from @localmode/core is a general capability detection function.

Both are async and check for a GPU adapter. Use the one from whichever package you're working with. See Capabilities for the full feature detection reference.

Browser Compatibility

BrowserWebGPUWASMNotes
Chrome 113+Best performance with WebGPU
Edge 113+Same as Chrome
FirefoxWASM only
Safari 26+WebGPU available
iOS SafariWebGPU available (iOS 26+)

Best Practices

Model Lifecycle — Singleton Caching

Model creation in @localmode/transformers triggers a download (first load) or cache read (subsequent loads). Always reuse model instances rather than creating new ones on every call:

import { transformers } from '@localmode/transformers';
import type { EmbeddingModel } from '@localmode/core';

// ✅ CORRECT: Create once, reuse everywhere
let embeddingModel: EmbeddingModel | null = null;

function getEmbeddingModel() {
  if (!embeddingModel) {
    embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
  }
  return embeddingModel;
}

// In your service functions
export async function embedText(text: string) {
  const model = getEmbeddingModel();
  return embed({ model, value: text });
}
// ❌ WRONG: Creating a new instance every call
export async function embedText(text: string) {
  const model = transformers.embedding('Xenova/bge-small-en-v1.5'); // Wasteful!
  return embed({ model, value: text });
}

Model creation is lightweight (it returns a lazy proxy), but keeping a single reference avoids redundant setup. This pattern is used across all 21 showcase apps that use @localmode/transformers.

WebGPU Device Detection

WebGPU provides GPU acceleration for 3-5x faster inference compared to WASM. Use device detection to automatically select the best backend:

import { transformers, isWebGPUAvailable } from '@localmode/transformers';

// Detect optimal device at app startup
const device = (await isWebGPUAvailable()) ? 'webgpu' : 'wasm';

// Pass device to model creation
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
  device,
  quantized: true,
});

This is especially valuable for compute-heavy tasks like embeddings, reranking, and speech processing. For lightweight tasks (classification, fill-mask), WASM performance is often sufficient.

Abort Error Handling

All @localmode functions support AbortSignal for cancellation. A clean abort pattern involves a custom error class in your service layer and proper handling in your hooks:

Service layer — Create and throw a recognizable abort error:

// _services/embedding.service.ts
export class EmbeddingAbortError extends Error {
  constructor() {
    super('Embedding was cancelled');
    this.name = 'EmbeddingAbortError';
  }
}

export async function generateEmbeddings(
  texts: string[],
  signal?: AbortSignal
) {
  try {
    return await embedMany({ model: getModel(), values: texts, abortSignal: signal });
  } catch (error) {
    if (error instanceof Error && error.name === 'AbortError') {
      throw new EmbeddingAbortError();
    }
    throw error;
  }
}

Hook layer — Manage the AbortController lifecycle and distinguish abort from real errors:

// _hooks/use-embedding.ts
export function useEmbedding() {
  const store = useEmbeddingStore();
  const controllerRef = useRef<AbortController | null>(null);

  const generate = async (texts: string[]) => {
    // Cancel any in-flight request
    controllerRef.current?.abort();
    controllerRef.current = new AbortController();

    store.setLoading(true);
    store.clearError();

    try {
      const result = await generateEmbeddings(texts, controllerRef.current.signal);
      store.setResult(result);
    } catch (error) {
      if (error instanceof EmbeddingAbortError) {
        return; // Silently ignore — user cancelled
      }
      store.setError(error instanceof Error ? error.message : 'Unknown error');
    } finally {
      store.setLoading(false);
    }
  };

  const cancel = () => controllerRef.current?.abort();

  return { generate, cancel };
}

Always abort the previous request before starting a new one. This prevents race conditions where an old response overwrites a newer one.

Vision (Image Input)

Qwen3.5 ONNX models support vision input via their built-in vision encoder. Images are processed through AutoProcessor and fed to the model alongside text.

import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

// Qwen3.5 models have supportsVision: true
console.log(model.supportsVision); // true

const result = await streamText({
  model,
  prompt: '',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this image.' },
      { type: 'image', data: base64Data, mimeType: 'image/png' },
    ],
  }],
});

Vision-Capable ONNX Models

ModelSizeContextNotes
Qwen3.5 0.8B~500MB32KBest quality sub-1B multimodal
Qwen3.5 2B~1.5GB32KHigher quality, 4GB+ RAM
Qwen3.5 4B~2.5GB32KBest quality, 8GB+ RAM, WebGPU required

For full multimodal API reference including ContentPart types and utilities, see the Core Generation guide.

Performance Tips

Performance

  1. Use quantized models - Smaller and faster with minimal quality loss
  2. Preload models - Load during app init for instant inference
  3. Use WebGPU when available - 3-5x faster than WASM
  4. Batch operations - Process multiple inputs together
  5. Cache model instances - Use the singleton pattern above to avoid redundant setup

Next Steps

On this page