Overview
LiteRT provider for browser LLM inference via Google's first-party `.litertlm` runtime. WebGPU with a CPU WASM fallback, a curated catalog of Gemma 4 E2B/E4B and Qwen3 0.6B, all verified end-to-end.
@localmode/litert
Run Google's LiteRT-LM .litertlm models directly in the browser via a WebGPU backend with a CPU WASM fallback. First-party Google inference engine. The curated catalog ships three models — Gemma 4 E2B, Gemma 4 E4B, and Qwen3 0.6B — all verified to load and generate end-to-end in real Chrome.
Early preview
Wraps @litert-lm/core@^0.12.1, the first published JavaScript release of Google's LiteRT-LM runtime. The JS API is text-in / text-out — the Gemma 4 models are multimodal, but the JS API does not yet expose vision or audio input (a future release may add it). APIs and model availability may change as upstream stabilizes. The WASM binaries total roughly 38MB unpacked — lazy-load this provider so the cost is only paid when LiteRT is actually used.
See it in action
Try LLM Chat for a working demo -- LiteRT is selectable as a fourth backend alongside WebLLM, wllama, and ONNX.
Features
- First-party Google runtime --
.litertlmmodels executed by the official LiteRT-LM engine - Three verified models -- Gemma 4 E2B, Gemma 4 E4B (WebGPU-only), and Qwen3 0.6B (WebGPU or CPU), all confirmed end-to-end in real Chrome
- WebGPU GPU backend -- Chrome 113+, Edge 113+, Safari 26+, Firefox 141+ (Windows) / 147+ (macOS Apple Silicon)
- CPU WASM backend -- for portable models (Qwen3 0.6B); Gemma 4 builds are GPU-compiled and require WebGPU
- Streaming -- real-time token generation
- AbortSignal cancellation -- stop generation mid-stream
- Lazy load + load deduplication -- WASM binaries only fetched on first use; concurrent loads share a single promise
Installation
bash pnpm install @localmode/litert @localmode/core bash npm install @localmode/litert @localmode/core bash yarn add @localmode/litert @localmode/core bash bun add @localmode/litert @localmode/core Quick Start
import { generateText } from '@localmode/core';
import { litert } from '@localmode/litert';
const { text } = await generateText({
model: litert.languageModel('gemma-4-E2B'),
prompt: 'Hello!',
});
console.log(text);Streaming
import { streamText } from '@localmode/core';
import { litert } from '@localmode/litert';
const result = await streamText({
model: litert.languageModel('gemma-4-E2B'),
prompt: 'Write a haiku about programming.',
});
let fullText = '';
for await (const chunk of result.stream) {
fullText += chunk.text;
// Update your UI with each chunk
}Model Selection
Use the curated catalog, a HuggingFace repo:file shorthand, or pass any full .litertlm URL:
import { litert } from '@localmode/litert';
// Curated catalog entry
const model = litert.languageModel('gemma-4-E2B');
// HuggingFace shorthand (repo:filename)
const model2 = litert.languageModel(
'litert-community/Qwen3-0.6B:Qwen3-0.6B.litertlm'
);
// Full URL
const model3 = litert.languageModel(
'https://huggingface.co/litert-community/Qwen3-0.6B/resolve/main/Qwen3-0.6B.litertlm'
);Recommended picks
- Gemma 4 E2B -- 2.0GB, one of the two models Google officially supports for the JS API. The default recommendation. WebGPU-only.
- Gemma 4 E4B -- 3.0GB, higher quality than E2B. WebGPU-only.
- Qwen3 0.6B -- 614MB, Apache-2.0, the lightweight sub-1GB option. Runs on WebGPU or CPU.
Gemma 4 requires WebGPU
The Gemma 4 *-it-web.litertlm builds are GPU-compiled (their TFLite sections carry a gpu_artisan backend constraint) — they run on WebGPU only and cannot run on the CPU backend. On a browser without WebGPU, loading a Gemma 4 model fails fast with a clear ModelLoadError. Qwen3 0.6B is a portable build that runs on either backend — use it when CPU support matters.
For the full catalog and gated-model loading, see Models.
Configuration
Model Options
const model = litert.languageModel('gemma-4-E2B', {
systemPrompt: 'You are a helpful coding assistant.',
temperature: 0.7,
maxTokens: 1024,
topP: 0.9,
});Prop
Type
Custom Provider
import { createLitert } from '@localmode/litert';
const myLitert = createLitert({
onProgress: (progress) => {
console.log(`Loading: ${Math.round(progress.progress ?? 0)}%`);
},
});
const model = myLitert.languageModel('gemma-4-E2B');Model Preloading
Preload a model during app initialization so the multi-GB download isn't paid on the first generation call:
import { preloadModel, isModelCached } from '@localmode/litert';
if (!(await isModelCached('gemma-4-E2B'))) {
await preloadModel('gemma-4-E2B', {
onProgress: (progress) => {
updateLoadingBar(progress.progress ?? 0);
},
});
}Model Management
import { deleteModelCache } from '@localmode/litert';
await deleteModelCache('gemma-4-E2B');Browser Compatibility
Check whether the current browser can run LiteRT before committing to a multi-GB model download:
import { checkLiteRTBrowserCompat } from '@localmode/litert';
const compat = await checkLiteRTBrowserCompat();
if (compat.canRun) {
console.log('Backend:', compat.backend); // 'GPU' | 'CPU'
} else {
console.log('Warnings:', compat.warnings);
console.log('Recommendations:', compat.recommendations);
}| Browser | WebGPU backend | WASM (CPU) backend |
|---|---|---|
| Chrome 113+ | Yes | Yes |
| Edge 113+ | Yes | Yes |
| Safari 26+ | Yes | Yes |
| Firefox 141+ (Win) / 147+ (macOS AS) | Yes | Yes |
LiteRT-LM uses WebGPU when available. The CPU (WASM) backend works on every WebAssembly-capable browser but is slower — and note it is only usable for portable models. The Gemma 4 builds are GPU-compiled and run on WebGPU only; Qwen3 0.6B runs on either backend.
Provider Fallback Pattern
Use LiteRT as a primary provider with @localmode/wllama as a fallback:
import { litert } from '@localmode/litert';
import { wllama } from '@localmode/wllama';
let model;
try {
model = litert.languageModel('gemma-4-E2B');
} catch (error) {
console.warn('LiteRT unavailable, falling back to wllama:', error);
model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
}LocalMode LLM Provider Comparison
| Provider | When to use |
|---|---|
@localmode/litert | First-party Google .litertlm runtime for Gemma 4; early preview, text-only |
@localmode/webllm | 32 curated MLC models with mature WebGPU kernels; broadest coverage |
@localmode/wllama | Any of the 160,000+ GGUF models on HuggingFace; runs on pure WASM without WebGPU |
@localmode/transformers | ONNX models via Transformers.js; widest task coverage beyond text generation |
@localmode/chrome-ai | Zero-download Gemini Nano via Chrome's built-in Prompt API |
Known Limitations
- Early preview --
@litert-lm/coreis pinned at^0.12.1, the first published JS release. Expect breaking changes upstream. - Text-only (for now) -- the Gemma 4 models are multimodal (their
.litertlmfiles ship vision and audio encoders), but the LiteRT-LM JS API (@litert-lm/core@0.12.1) does not yet expose those modalities. EnablingvisionModalityEnabled/audioModalityEnabledthrowsVision/Audio options should not be null— the JS API has no way to supply the required executor options (verified by direct testing). Multimodal input may arrive in a future@litert-lm/corerelease. - Gemma 4 is WebGPU-only -- the
*-it-web.litertlmGemma 4 builds are GPU-compiled and cannot run on the CPU backend. Only Qwen3 0.6B runs on CPU. On a non-WebGPU browser, loading a Gemma 4 model fails fast with a clearModelLoadError. - No
stopSequences-- the runtime uses token IDs, not user-supplied stop strings; usemaxTokensor rely on the model's natural EOS. - Estimated token usage --
usagetoken counts are estimated from text length; the runtime does not expose exact tokenizer counts in this release.
Error Handling
import { generateText, ModelLoadError, GenerationError } from '@localmode/core';
try {
const { text } = await generateText({ model, prompt: 'Hello' });
} catch (error) {
if (error instanceof ModelLoadError) {
console.error('Failed to load model:', error.hint);
} else if (error instanceof GenerationError) {
console.error('Generation failed:', error.hint);
}
}Next Steps
Models
The curated catalog, the web-optimized Gemma 4 builds, and loading gated models via custom URL.
Text Generation
API reference for streamText, generateText, and generation options.
wllama Provider
WASM-based alternative covering 160K+ GGUF models.