Overview
HuggingFace Transformers.js provider for browser-based ML inference.
@localmode/transformers
HuggingFace Transformers.js provider for LocalMode. Run ML models locally in the browser with WebGPU/WASM acceleration.
Features
- 🚀 Browser-Native — Run ML models directly in the browser
- 🔒 Privacy-First — All processing happens locally
- 📦 Model Caching — Models cached in IndexedDB for instant subsequent loads
- ⚡ Optimized — Uses quantized models for smaller size and faster inference
Installation
bash pnpm install @localmode/transformers @localmode/core bash npm install @localmode/transformers @localmode/core bash yarn add @localmode/transformers @localmode/core bash bun add @localmode/transformers @localmode/core Quick Start
import { transformers } from '@localmode/transformers';
import { embed, rerank } from '@localmode/core';
// Text Embeddings
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const { embedding } = await embed({ model: embeddingModel, value: 'Hello world' });
// Reranking for RAG
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const { results } = await rerank({
model: rerankerModel,
query: 'What is machine learning?',
documents: ['ML is a subset of AI...', 'Python is a language...'],
topK: 5,
});✅ Live Features
All features below are production-ready with implementations available.
Embeddings
Generate text embeddings for semantic search and RAG.
Reranking
Improve RAG accuracy with cross-encoder reranking.
Embeddings & Reranking
| Method | Interface | Description |
|---|---|---|
transformers.embedding(modelId) | EmbeddingModel | Text embeddings |
transformers.reranker(modelId) | RerankerModel | Document reranking |
Classification & NLP
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Text Classification | transformers.classifier(modelId) | ClassificationModel | Guide |
| Zero-Shot Classification | transformers.zeroShot(modelId) | ZeroShotClassificationModel | Guide |
| Named Entity Recognition | transformers.ner(modelId) | NERModel | Guide |
Translation & Text Processing
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Translation | transformers.translator(modelId) | TranslationModel | Guide |
| Summarization | transformers.summarizer(modelId) | SummarizationModel | Guide |
| Fill-Mask | transformers.fillMask(modelId) | FillMaskModel | Guide |
| Question Answering | transformers.questionAnswering(modelId) | QuestionAnsweringModel | Guide |
Audio
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Speech-to-Text | transformers.speechToText(modelId) | SpeechToTextModel | Guide |
| Text-to-Speech | transformers.textToSpeech(modelId) | TextToSpeechModel | Guide — phonemizer-backed, 29 voices, English (US & GB) |
Vision
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Image Captioning | transformers.captioner(modelId) | ImageCaptionModel | Guide |
| Object Detection | transformers.objectDetector(modelId) | ObjectDetectionModel | Guide |
| Image Segmentation | transformers.segmenter(modelId) | SegmentationModel | Guide |
| Image Features | transformers.imageFeatures(modelId) | ImageFeatureModel | Guide |
| Image-to-Image | transformers.imageToImage(modelId) | ImageToImageModel | Guide |
| Image Classification | transformers.imageClassifier(modelId) | ImageClassificationModel | Guide |
| Zero-Shot Image Classification | transformers.zeroShotImageClassifier(modelId) | ZeroShotImageClassificationModel | Guide |
| OCR | transformers.ocr(modelId) | OCRModel | Guide |
| Document QA | transformers.documentQA(modelId) | DocumentQAModel | Guide |
Recommended Models
Click the Guide links in the tables above for detailed documentation, recommended models, and usage examples for each feature.
Model Options
Configure model loading:
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
quantized: true, // Use quantized model (smaller, faster)
revision: 'main', // Model revision
onProgress: (p) => {
console.log(`Loading: ${(p.progress * 100).toFixed(1)}%`);
},
});Model Utilities
Manage model loading and caching:
import { preloadModel, isModelCached, getModelStorageUsage } from '@localmode/transformers';
// Check if model is cached
const cached = await isModelCached('Xenova/bge-small-en-v1.5');
// Preload model with progress
await preloadModel('Xenova/bge-small-en-v1.5', {
onProgress: (p) => console.log(`${p.progress}% loaded`),
});
// Check storage usage
const usage = await getModelStorageUsage();Custom Provider Instances
Use createTransformers() to create a provider instance with custom settings instead of the default singleton:
import { createTransformers } from '@localmode/transformers';
// Force WebGPU device
const gpuTransformers = createTransformers({
device: 'webgpu',
onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});
const model = gpuTransformers.embedding('Xenova/bge-small-en-v1.5');// Offload inference to a Web Worker
const workerTransformers = createTransformers({
useWorker: true,
});| Option | Type | Default | Description |
|---|---|---|---|
device | 'webgpu' | 'wasm' | 'cpu' | 'auto' | 'auto' | Inference device |
quantized | boolean | false | Use quantized models |
onProgress | (progress) => void | — | Model loading progress callback |
useWorker | boolean | false | Run inference in a Web Worker |
WebGPU Detection
Detect WebGPU availability for optimal device selection:
import { isWebGPUAvailable, getOptimalDevice } from '@localmode/transformers';
// Check if WebGPU is available
const webgpuAvailable = await isWebGPUAvailable();
if (webgpuAvailable) {
console.log('WebGPU available, using GPU acceleration');
} else {
console.log('Falling back to WASM');
}
// Get optimal device automatically
const device = await getOptimalDevice(); // 'webgpu' or 'wasm'
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
device, // Uses WebGPU if available, otherwise WASM
});isWebGPUAvailable() vs isWebGPUSupported()
isWebGPUAvailable() from @localmode/transformers is a provider-specific check for this package.
isWebGPUSupported() from @localmode/core is a general capability detection function.
Both are async and check for a GPU adapter. Use the one from whichever package you're working with. See Capabilities for the full feature detection reference.
Browser Compatibility
| Browser | WebGPU | WASM | Notes |
|---|---|---|---|
| Chrome 113+ | ✅ | ✅ | Best performance with WebGPU |
| Edge 113+ | ✅ | ✅ | Same as Chrome |
| Firefox | ❌ | ✅ | WASM only |
| Safari 26+ | ✅ | ✅ | WebGPU available |
| iOS Safari | ✅ | ✅ | WebGPU available (iOS 26+) |
Best Practices
Model Lifecycle — Singleton Caching
Model creation in @localmode/transformers triggers a download (first load) or cache read (subsequent loads). Always reuse model instances rather than creating new ones on every call:
import { transformers } from '@localmode/transformers';
import type { EmbeddingModel } from '@localmode/core';
// ✅ CORRECT: Create once, reuse everywhere
let embeddingModel: EmbeddingModel | null = null;
function getEmbeddingModel() {
if (!embeddingModel) {
embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
}
return embeddingModel;
}
// In your service functions
export async function embedText(text: string) {
const model = getEmbeddingModel();
return embed({ model, value: text });
}// ❌ WRONG: Creating a new instance every call
export async function embedText(text: string) {
const model = transformers.embedding('Xenova/bge-small-en-v1.5'); // Wasteful!
return embed({ model, value: text });
}Model creation is lightweight (it returns a lazy proxy), but keeping a single reference avoids redundant setup. This pattern is used across all 21 showcase apps that use @localmode/transformers.
WebGPU Device Detection
WebGPU provides GPU acceleration for 3-5x faster inference compared to WASM. Use device detection to automatically select the best backend:
import { transformers, isWebGPUAvailable } from '@localmode/transformers';
// Detect optimal device at app startup
const device = (await isWebGPUAvailable()) ? 'webgpu' : 'wasm';
// Pass device to model creation
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
device,
quantized: true,
});This is especially valuable for compute-heavy tasks like embeddings, reranking, and speech processing. For lightweight tasks (classification, fill-mask), WASM performance is often sufficient.
Abort Error Handling
All @localmode functions support AbortSignal for cancellation. A clean abort pattern involves a custom error class in your service layer and proper handling in your hooks:
Service layer — Create and throw a recognizable abort error:
// _services/embedding.service.ts
export class EmbeddingAbortError extends Error {
constructor() {
super('Embedding was cancelled');
this.name = 'EmbeddingAbortError';
}
}
export async function generateEmbeddings(
texts: string[],
signal?: AbortSignal
) {
try {
return await embedMany({ model: getModel(), values: texts, abortSignal: signal });
} catch (error) {
if (error instanceof Error && error.name === 'AbortError') {
throw new EmbeddingAbortError();
}
throw error;
}
}Hook layer — Manage the AbortController lifecycle and distinguish abort from real errors:
// _hooks/use-embedding.ts
export function useEmbedding() {
const store = useEmbeddingStore();
const controllerRef = useRef<AbortController | null>(null);
const generate = async (texts: string[]) => {
// Cancel any in-flight request
controllerRef.current?.abort();
controllerRef.current = new AbortController();
store.setLoading(true);
store.clearError();
try {
const result = await generateEmbeddings(texts, controllerRef.current.signal);
store.setResult(result);
} catch (error) {
if (error instanceof EmbeddingAbortError) {
return; // Silently ignore — user cancelled
}
store.setError(error instanceof Error ? error.message : 'Unknown error');
} finally {
store.setLoading(false);
}
};
const cancel = () => controllerRef.current?.abort();
return { generate, cancel };
}Always abort the previous request before starting a new one. This prevents race conditions where an old response overwrites a newer one.
Vision (Image Input)
Qwen3.5 ONNX models support vision input via their built-in vision encoder. Images are processed through AutoProcessor and fed to the model alongside text.
import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
// Qwen3.5 models have supportsVision: true
console.log(model.supportsVision); // true
const result = await streamText({
model,
prompt: '',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image.' },
{ type: 'image', data: base64Data, mimeType: 'image/png' },
],
}],
});Vision-Capable ONNX Models
| Model | Size | Context | Notes |
|---|---|---|---|
| Qwen3.5 0.8B | ~500MB | 32K | Best quality sub-1B multimodal |
| Qwen3.5 2B | ~1.5GB | 32K | Higher quality, 4GB+ RAM |
| Qwen3.5 4B | ~2.5GB | 32K | Best quality, 8GB+ RAM, WebGPU required |
For full multimodal API reference including ContentPart types and utilities, see the Core Generation guide.
Performance Tips
Performance
- Use quantized models - Smaller and faster with minimal quality loss
- Preload models - Load during app init for instant inference
- Use WebGPU when available - 3-5x faster than WASM
- Batch operations - Process multiple inputs together
- Cache model instances - Use the singleton pattern above to avoid redundant setup