LocalMode
LiteRT

Overview

LiteRT provider for browser LLM inference via Google's first-party `.litertlm` runtime. WebGPU with a CPU WASM fallback, a curated catalog of Gemma 4 E2B/E4B and Qwen3 0.6B, all verified end-to-end.

@localmode/litert

Run Google's LiteRT-LM .litertlm models directly in the browser via a WebGPU backend with a CPU WASM fallback. First-party Google inference engine. The curated catalog ships three models — Gemma 4 E2B, Gemma 4 E4B, and Qwen3 0.6B — all verified to load and generate end-to-end in real Chrome.

Early preview

Wraps @litert-lm/core@^0.12.1, the first published JavaScript release of Google's LiteRT-LM runtime. The JS API is text-in / text-out — the Gemma 4 models are multimodal, but the JS API does not yet expose vision or audio input (a future release may add it). APIs and model availability may change as upstream stabilizes. The WASM binaries total roughly 38MB unpacked — lazy-load this provider so the cost is only paid when LiteRT is actually used.

See it in action

Try LLM Chat for a working demo -- LiteRT is selectable as a fourth backend alongside WebLLM, wllama, and ONNX.

Features

  • First-party Google runtime -- .litertlm models executed by the official LiteRT-LM engine
  • Three verified models -- Gemma 4 E2B, Gemma 4 E4B (WebGPU-only), and Qwen3 0.6B (WebGPU or CPU), all confirmed end-to-end in real Chrome
  • WebGPU GPU backend -- Chrome 113+, Edge 113+, Safari 26+, Firefox 141+ (Windows) / 147+ (macOS Apple Silicon)
  • CPU WASM backend -- for portable models (Qwen3 0.6B); Gemma 4 builds are GPU-compiled and require WebGPU
  • Streaming -- real-time token generation
  • AbortSignal cancellation -- stop generation mid-stream
  • Lazy load + load deduplication -- WASM binaries only fetched on first use; concurrent loads share a single promise

Installation

bash pnpm install @localmode/litert @localmode/core
bash npm install @localmode/litert @localmode/core
bash yarn add @localmode/litert @localmode/core
bash bun add @localmode/litert @localmode/core

Quick Start

import { generateText } from '@localmode/core';
import { litert } from '@localmode/litert';

const { text } = await generateText({
  model: litert.languageModel('gemma-4-E2B'),
  prompt: 'Hello!',
});

console.log(text);

Streaming

import { streamText } from '@localmode/core';
import { litert } from '@localmode/litert';

const result = await streamText({
  model: litert.languageModel('gemma-4-E2B'),
  prompt: 'Write a haiku about programming.',
});

let fullText = '';
for await (const chunk of result.stream) {
  fullText += chunk.text;
  // Update your UI with each chunk
}

Model Selection

Use the curated catalog, a HuggingFace repo:file shorthand, or pass any full .litertlm URL:

import { litert } from '@localmode/litert';

// Curated catalog entry
const model = litert.languageModel('gemma-4-E2B');

// HuggingFace shorthand (repo:filename)
const model2 = litert.languageModel(
  'litert-community/Qwen3-0.6B:Qwen3-0.6B.litertlm'
);

// Full URL
const model3 = litert.languageModel(
  'https://huggingface.co/litert-community/Qwen3-0.6B/resolve/main/Qwen3-0.6B.litertlm'
);

Recommended picks

  • Gemma 4 E2B -- 2.0GB, one of the two models Google officially supports for the JS API. The default recommendation. WebGPU-only.
  • Gemma 4 E4B -- 3.0GB, higher quality than E2B. WebGPU-only.
  • Qwen3 0.6B -- 614MB, Apache-2.0, the lightweight sub-1GB option. Runs on WebGPU or CPU.

Gemma 4 requires WebGPU

The Gemma 4 *-it-web.litertlm builds are GPU-compiled (their TFLite sections carry a gpu_artisan backend constraint) — they run on WebGPU only and cannot run on the CPU backend. On a browser without WebGPU, loading a Gemma 4 model fails fast with a clear ModelLoadError. Qwen3 0.6B is a portable build that runs on either backend — use it when CPU support matters.

For the full catalog and gated-model loading, see Models.

Configuration

Model Options

const model = litert.languageModel('gemma-4-E2B', {
  systemPrompt: 'You are a helpful coding assistant.',
  temperature: 0.7,
  maxTokens: 1024,
  topP: 0.9,
});

Prop

Type

Custom Provider

import { createLitert } from '@localmode/litert';

const myLitert = createLitert({
  onProgress: (progress) => {
    console.log(`Loading: ${Math.round(progress.progress ?? 0)}%`);
  },
});

const model = myLitert.languageModel('gemma-4-E2B');

Model Preloading

Preload a model during app initialization so the multi-GB download isn't paid on the first generation call:

import { preloadModel, isModelCached } from '@localmode/litert';

if (!(await isModelCached('gemma-4-E2B'))) {
  await preloadModel('gemma-4-E2B', {
    onProgress: (progress) => {
      updateLoadingBar(progress.progress ?? 0);
    },
  });
}

Model Management

import { deleteModelCache } from '@localmode/litert';

await deleteModelCache('gemma-4-E2B');

Browser Compatibility

Check whether the current browser can run LiteRT before committing to a multi-GB model download:

import { checkLiteRTBrowserCompat } from '@localmode/litert';

const compat = await checkLiteRTBrowserCompat();

if (compat.canRun) {
  console.log('Backend:', compat.backend); // 'GPU' | 'CPU'
} else {
  console.log('Warnings:', compat.warnings);
  console.log('Recommendations:', compat.recommendations);
}
BrowserWebGPU backendWASM (CPU) backend
Chrome 113+YesYes
Edge 113+YesYes
Safari 26+YesYes
Firefox 141+ (Win) / 147+ (macOS AS)YesYes

LiteRT-LM uses WebGPU when available. The CPU (WASM) backend works on every WebAssembly-capable browser but is slower — and note it is only usable for portable models. The Gemma 4 builds are GPU-compiled and run on WebGPU only; Qwen3 0.6B runs on either backend.

Provider Fallback Pattern

Use LiteRT as a primary provider with @localmode/wllama as a fallback:

import { litert } from '@localmode/litert';
import { wllama } from '@localmode/wllama';

let model;
try {
  model = litert.languageModel('gemma-4-E2B');
} catch (error) {
  console.warn('LiteRT unavailable, falling back to wllama:', error);
  model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
}

LocalMode LLM Provider Comparison

ProviderWhen to use
@localmode/litertFirst-party Google .litertlm runtime for Gemma 4; early preview, text-only
@localmode/webllm32 curated MLC models with mature WebGPU kernels; broadest coverage
@localmode/wllamaAny of the 160,000+ GGUF models on HuggingFace; runs on pure WASM without WebGPU
@localmode/transformersONNX models via Transformers.js; widest task coverage beyond text generation
@localmode/chrome-aiZero-download Gemini Nano via Chrome's built-in Prompt API

Known Limitations

  • Early preview -- @litert-lm/core is pinned at ^0.12.1, the first published JS release. Expect breaking changes upstream.
  • Text-only (for now) -- the Gemma 4 models are multimodal (their .litertlm files ship vision and audio encoders), but the LiteRT-LM JS API (@litert-lm/core@0.12.1) does not yet expose those modalities. Enabling visionModalityEnabled / audioModalityEnabled throws Vision/Audio options should not be null — the JS API has no way to supply the required executor options (verified by direct testing). Multimodal input may arrive in a future @litert-lm/core release.
  • Gemma 4 is WebGPU-only -- the *-it-web.litertlm Gemma 4 builds are GPU-compiled and cannot run on the CPU backend. Only Qwen3 0.6B runs on CPU. On a non-WebGPU browser, loading a Gemma 4 model fails fast with a clear ModelLoadError.
  • No stopSequences -- the runtime uses token IDs, not user-supplied stop strings; use maxTokens or rely on the model's natural EOS.
  • Estimated token usage -- usage token counts are estimated from text length; the runtime does not expose exact tokenizer counts in this release.

Error Handling

import { generateText, ModelLoadError, GenerationError } from '@localmode/core';

try {
  const { text } = await generateText({ model, prompt: 'Hello' });
} catch (error) {
  if (error instanceof ModelLoadError) {
    console.error('Failed to load model:', error.hint);
  } else if (error instanceof GenerationError) {
    console.error('Generation failed:', error.hint);
  }
}

Next Steps

Showcase Apps

AppDescriptionLinks
LLM ChatChat with LiteRT models alongside WebLLM, wllama, and ONNX backendsDemo · Source

On this page