Overview
HuggingFace Transformers.js provider for browser-based ML inference.
@localmode/transformers
HuggingFace Transformers.js provider for LocalMode. Run ML models locally in the browser with WebGPU/WASM acceleration.
Features
- 🚀 Browser-Native — Run ML models directly in the browser
- 🔒 Privacy-First — All processing happens locally
- 📦 Model Caching — Models cached in IndexedDB for instant subsequent loads
- ⚡ Optimized — Uses quantized models for smaller size and faster inference
Installation
bash pnpm install @localmode/transformers @localmode/core bash npm install @localmode/transformers @localmode/core bash yarn add @localmode/transformers @localmode/core bash bun add @localmode/transformers @localmode/core Quick Start
import { transformers } from '@localmode/transformers';
import { embed, rerank } from '@localmode/core';
// Text Embeddings
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const { embedding } = await embed({ model: embeddingModel, value: 'Hello world' });
// Reranking for RAG
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const { results } = await rerank({
model: rerankerModel,
query: 'What is machine learning?',
documents: ['ML is a subset of AI...', 'Python is a language...'],
topK: 5,
});✅ Live Features
All features below are production-ready with implementations available.
Embeddings
Generate text embeddings for semantic search and RAG.
Reranking
Improve RAG accuracy with cross-encoder reranking.
Embeddings & Reranking
| Method | Interface | Description |
|---|---|---|
transformers.embedding(modelId) | EmbeddingModel | Text embeddings |
transformers.reranker(modelId) | RerankerModel | Document reranking |
Classification & NLP
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Text Classification | transformers.classifier(modelId) | ClassificationModel | Guide |
| Zero-Shot Classification | transformers.zeroShot(modelId) | ZeroShotClassificationModel | Guide |
| Named Entity Recognition | transformers.ner(modelId) | NERModel | Guide |
Translation & Text Processing
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Translation | transformers.translator(modelId) | TranslationModel | Guide |
| Summarization | transformers.summarizer(modelId) | SummarizationModel | Guide |
| Fill-Mask | transformers.fillMask(modelId) | FillMaskModel | Guide |
| Question Answering | transformers.questionAnswering(modelId) | QuestionAnsweringModel | Guide |
Audio
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Speech-to-Text | transformers.speechToText(modelId) | SpeechToTextModel | Guide |
| Text-to-Speech | transformers.textToSpeech(modelId) | TextToSpeechModel | Guide |
Vision
| Feature | Method | Interface | Docs |
|---|---|---|---|
| Image Captioning | transformers.captioner(modelId) | ImageCaptionModel | Guide |
| Object Detection | transformers.objectDetector(modelId) | ObjectDetectionModel | Guide |
| Image Segmentation | transformers.segmenter(modelId) | SegmentationModel | Guide |
| Image Features | transformers.imageFeatures(modelId) | ImageFeatureModel | Guide |
| Image-to-Image | transformers.imageToImage(modelId) | ImageToImageModel | Guide |
| Image Classification | transformers.imageClassifier(modelId) | ImageClassificationModel | Guide |
| Zero-Shot Image Classification | transformers.zeroShotImageClassifier(modelId) | ZeroShotImageClassificationModel | Guide |
| OCR | transformers.ocr(modelId) | OCRModel | Guide |
| Document QA | transformers.documentQA(modelId) | DocumentQAModel | Guide |
Recommended Models
Click the Guide links in the tables above for detailed documentation, recommended models, and usage examples for each feature.
Model Options
Configure model loading:
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
quantized: true, // Use quantized model (smaller, faster)
revision: 'main', // Model revision
onProgress: (p) => {
console.log(`Loading: ${(p.progress * 100).toFixed(1)}%`);
},
});Model Utilities
Manage model loading and caching:
import { preloadModel, isModelCached, getModelStorageUsage } from '@localmode/transformers';
// Check if model is cached
const cached = await isModelCached('Xenova/bge-small-en-v1.5');
// Preload model with progress
await preloadModel('Xenova/bge-small-en-v1.5', {
onProgress: (p) => console.log(`${p.progress}% loaded`),
});
// Check storage usage
const usage = await getModelStorageUsage();Custom Provider Instances
Use createTransformers() to create a provider instance with custom settings instead of the default singleton:
import { createTransformers } from '@localmode/transformers';
// Force WebGPU device
const gpuTransformers = createTransformers({
device: 'webgpu',
onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});
const model = gpuTransformers.embedding('Xenova/bge-small-en-v1.5');// Offload inference to a Web Worker
const workerTransformers = createTransformers({
useWorker: true,
});| Option | Type | Default | Description |
|---|---|---|---|
device | 'webgpu' | 'wasm' | 'cpu' | 'auto' | 'auto' | Inference device |
quantized | boolean | false | Use quantized models |
onProgress | (progress) => void | — | Model loading progress callback |
useWorker | boolean | false | Run inference in a Web Worker |
WebGPU Detection
Detect WebGPU availability for optimal device selection:
import { isWebGPUAvailable, getOptimalDevice } from '@localmode/transformers';
// Check if WebGPU is available
const webgpuAvailable = await isWebGPUAvailable();
if (webgpuAvailable) {
console.log('WebGPU available, using GPU acceleration');
} else {
console.log('Falling back to WASM');
}
// Get optimal device automatically
const device = await getOptimalDevice(); // 'webgpu' or 'wasm'
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
device, // Uses WebGPU if available, otherwise WASM
});isWebGPUAvailable() vs isWebGPUSupported()
isWebGPUAvailable() from @localmode/transformers is a provider-specific check for this package.
isWebGPUSupported() from @localmode/core is a general capability detection function.
Both are async and check for a GPU adapter. Use the one from whichever package you're working with. See Capabilities for the full feature detection reference.
Browser Compatibility
| Browser | WebGPU | WASM | Notes |
|---|---|---|---|
| Chrome 113+ | ✅ | ✅ | Best performance with WebGPU |
| Edge 113+ | ✅ | ✅ | Same as Chrome |
| Firefox | ❌ | ✅ | WASM only |
| Safari 18+ | ✅ | ✅ | WebGPU available |
| iOS Safari | ✅ | ✅ | WebGPU available (iOS 26+) |
Best Practices
Model Lifecycle — Singleton Caching
Model creation in @localmode/transformers triggers a download (first load) or cache read (subsequent loads). Always reuse model instances rather than creating new ones on every call:
import { transformers } from '@localmode/transformers';
import type { EmbeddingModel } from '@localmode/core';
// ✅ CORRECT: Create once, reuse everywhere
let embeddingModel: EmbeddingModel | null = null;
function getEmbeddingModel() {
if (!embeddingModel) {
embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
}
return embeddingModel;
}
// In your service functions
export async function embedText(text: string) {
const model = getEmbeddingModel();
return embed({ model, value: text });
}// ❌ WRONG: Creating a new instance every call
export async function embedText(text: string) {
const model = transformers.embedding('Xenova/bge-small-en-v1.5'); // Wasteful!
return embed({ model, value: text });
}Model creation is lightweight (it returns a lazy proxy), but keeping a single reference avoids redundant setup. This pattern is used across all 21 showcase apps that use @localmode/transformers.
WebGPU Device Detection
WebGPU provides GPU acceleration for 3-5x faster inference compared to WASM. Use device detection to automatically select the best backend:
import { transformers, isWebGPUAvailable } from '@localmode/transformers';
// Detect optimal device at app startup
const device = (await isWebGPUAvailable()) ? 'webgpu' : 'wasm';
// Pass device to model creation
const model = transformers.embedding('Xenova/bge-small-en-v1.5', {
device,
quantized: true,
});This is especially valuable for compute-heavy tasks like embeddings, reranking, and speech processing. For lightweight tasks (classification, fill-mask), WASM performance is often sufficient.
Abort Error Handling
All @localmode functions support AbortSignal for cancellation. A clean abort pattern involves a custom error class in your service layer and proper handling in your hooks:
Service layer — Create and throw a recognizable abort error:
// _services/embedding.service.ts
export class EmbeddingAbortError extends Error {
constructor() {
super('Embedding was cancelled');
this.name = 'EmbeddingAbortError';
}
}
export async function generateEmbeddings(
texts: string[],
signal?: AbortSignal
) {
try {
return await embedMany({ model: getModel(), values: texts, abortSignal: signal });
} catch (error) {
if (error instanceof Error && error.name === 'AbortError') {
throw new EmbeddingAbortError();
}
throw error;
}
}Hook layer — Manage the AbortController lifecycle and distinguish abort from real errors:
// _hooks/use-embedding.ts
export function useEmbedding() {
const store = useEmbeddingStore();
const controllerRef = useRef<AbortController | null>(null);
const generate = async (texts: string[]) => {
// Cancel any in-flight request
controllerRef.current?.abort();
controllerRef.current = new AbortController();
store.setLoading(true);
store.clearError();
try {
const result = await generateEmbeddings(texts, controllerRef.current.signal);
store.setResult(result);
} catch (error) {
if (error instanceof EmbeddingAbortError) {
return; // Silently ignore — user cancelled
}
store.setError(error instanceof Error ? error.message : 'Unknown error');
} finally {
store.setLoading(false);
}
};
const cancel = () => controllerRef.current?.abort();
return { generate, cancel };
}Always abort the previous request before starting a new one. This prevents race conditions where an old response overwrites a newer one.
Vision (Image Input) — Experimental
Qwen3.5 ONNX models support vision input via their built-in vision encoder. Images are processed through AutoProcessor and fed to the model alongside text.
import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
// Qwen3.5 models have supportsVision: true
console.log(model.supportsVision); // true
const result = await streamText({
model,
prompt: '',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image.' },
{ type: 'image', data: base64Data, mimeType: 'image/png' },
],
}],
});Vision-Capable ONNX Models
| Model | Size | Context | Notes |
|---|---|---|---|
| Qwen3.5 0.8B | ~500MB | 32K | Best quality sub-1B multimodal |
| Qwen3.5 2B | ~1.5GB | 32K | Higher quality, 4GB+ RAM |
| Qwen3.5 4B | ~2.5GB | 32K | Best quality, 8GB+ RAM, WebGPU required |
Experimental
Vision support uses Transformers.js v4 (preview release). The API may change in future TJS versions.
For full multimodal API reference including ContentPart types and utilities, see the Core Generation guide.
Performance Tips
Performance
- Use quantized models - Smaller and faster with minimal quality loss
- Preload models - Load during app init for instant inference
- Use WebGPU when available - 3-5x faster than WASM
- Batch operations - Process multiple inputs together
- Cache model instances - Use the singleton pattern above to avoid redundant setup