What is the smallest Llama model available in LocalMode?

TinyLlama-1.1B-Chat via Transformers.js at 350MB is the smallest. Via WebLLM it is 400MB. TinyLlama loads quickly and handles basic text completion, simple Q&A, and template-based generation.

Does Llama work on mobile browsers?

The smaller variants like TinyLlama (350-670MB) and Llama-3.2-1B (380-750MB) can run on higher-end mobile devices. The wllama WASM provider works in all browsers. Larger variants like the 8B models (4.5-4.92GB) are desktop-only in practice.

How many providers support Llama models in LocalMode?

Llama is available through three providers: WebLLM (WebGPU for maximum speed), wllama (WASM for universal browser support including Firefox), and Transformers.js (ONNX-optimized). This ensures every user can run a Llama model regardless of browser.

What is the maximum context window for Llama models in the browser?

The wllama GGUF variants of Llama 3.2 (1B and 3B) and Llama 3.1 8B support up to 131,072 tokens, enabling processing of entire documents without chunking. WebLLM variants use 4,096-token contexts due to browser VRAM constraints.

Can I run Llama models offline?

Yes. All Llama variants work offline after the initial download, cached in IndexedDB via LocalMode's model caching system. No API keys or network connection is needed for inference.

Llama Models in the Browser

Meta's Llama family from 1B to 8B parameters, the most widely adopted open-weight LLM series for browser inference.

Overview

The Llama family is available through WebLLM (WebGPU), wllama (WASM), Transformers.js in LocalMode, with model sizes ranging from 400MB–4.92GB. The primary task for these models is generation, and they can be used with any application built on the LocalMode SDK.

Running Llama models locally in the browser eliminates API costs, removes network latency, and keeps all user data on-device. After the initial model download, inference is instant and works offline. Each model variant targets a different trade-off between size, speed, and quality - choose based on your users' device capabilities and your application's requirements.

Architecture and History

Meta's Llama is the model family that catalyzed the open-weight revolution. Llama 3.2 brought efficient small models (1B and 3B) purpose-built for edge deployment, while Llama 3.1 8B remains one of the highest-quality open models available. For browser inference, Llama offers an excellent quality-to-size ratio - the 3B variant at 1.76GB delivers surprisingly coherent responses for general chat, summarization, and light reasoning tasks.

The Llama architecture is uniquely well-supported across inference engines. WebLLM compiles Llama to WebGPU shaders for maximum throughput (the 1B model achieves 50-90 tokens/second on modern GPUs). Wllama runs Llama via llama.cpp compiled to WASM, enabling inference on any browser - including Firefox, which lacks WebGPU support. Transformers.js v4 offers ONNX-optimized Llama for environments where neither WebGPU nor heavy WASM is ideal.

A key advantage of Llama models in LocalMode is their massive context windows in GGUF format - the wllama 1B and 3B variants support up to 131,072 tokens, making them suitable for processing entire documents without chunking. The WebLLM variants use 4,096 token contexts, which is sufficient for most chat applications.

Variant Comparison

The following table lists every Llama variant available through LocalMode, across all supported providers. Click a model ID to view its HuggingFace model card.

Model ID	Provider	Size	Speed	Quality	Context	Device
TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC	WebLLM (WebGPU)	400MB	Fast	Basic	2,048 tokens	WEBGPU
Llama-3.2-1B-Instruct-q4f16_1-MLC	WebLLM (WebGPU)	712MB	Medium	Good	4,096 tokens	WEBGPU
Llama-3.2-3B-Instruct-q4f16_1-MLC	WebLLM (WebGPU)	1.76GB	Slow	High	4,096 tokens	WEBGPU
Llama-3.1-8B-Instruct-q4f16_1-MLC	WebLLM (WebGPU)	4.5GB	Slow	High	4,096 tokens	WEBGPU
TinyLlama-1.1B-Chat-Q4_K_M	wllama (WASM)	670MB	Medium	Good	2,048 tokens	WASM
Llama-3.2-1B-Instruct-Q4_K_M	wllama (WASM)	750MB	Medium	Good	131,072 tokens	WASM
Llama-3.2-3B-Instruct-Q4_K_M	wllama (WASM)	1.93GB	Medium	High	131,072 tokens	WASM
Llama-3.1-8B-Instruct-Q4_K_M	wllama (WASM)	4.92GB	Slow	High	131,072 tokens	WASM
onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX	Transformers.js	350MB	Medium	Basic	2,048 tokens	WEBGPU
onnx-community/Llama-3.2-1B-Instruct-ONNX	Transformers.js	380MB	Medium	Good	8,192 tokens	WEBGPU
onnx-community/Llama-3.2-3B-Instruct-ONNX	Transformers.js	900MB	Medium	High	8,192 tokens	WEBGPU

Size Distribution

Size Range	Count
200MB–500MB	3	variants
500MB–1.5GB	4	variants
1.5GB–3GB	2	variants
Over 3GB	2	variants

How to choose a variant: Start with the smallest model that meets your quality requirements. For prototyping and development, use the fastest variant (smallest size, "Fast" speed tier). For production, test your specific use case against 2–3 variants and measure the quality difference against user expectations. In many applications, users cannot distinguish between "Good" and "High" quality tiers - the smaller model saves download time and memory.

Provider-Specific Code Examples

All Llama variants use the same LanguageModel interface from @localmode/core. Switching between providers requires changing only the import and model ID - no application logic changes.

WebLLM (WebGPU)

WebLLM compiles models to WebGPU compute shaders for maximum inference speed. Requires Chrome 113+, Edge 113+, or Safari 26+.

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC');

const result = await streamText({
  model,
  prompt: 'Explain how Llama models work.',
  maxTokens: 300,
  temperature: 0.7,
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

wllama (WASM)

wllama runs GGUF models via llama.cpp compiled to WebAssembly. Works in every browser including Firefox.

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('TinyLlama-1.1B-Chat-Q4_K_M');

const result = await streamText({
  model,
  prompt: 'Summarize the benefits of local AI inference.',
  maxTokens: 300,
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

Transformers.js

Transformers.js runs ONNX-optimized models via ONNX Runtime Web. WebGPU acceleration where available, WASM fallback otherwise.

import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.languageModel('onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX');

const result = await streamText({
  model,
  prompt: 'Hello, world!',
  maxTokens: 200,
});

Fallback Pattern

For maximum browser compatibility, wrap model loading in a try/catch: attempt the preferred model first, and fall back to a smaller variant if it fails to load.

import { transformers } from '@localmode/transformers';

// Try the preferred model, fall back to a smaller one on failure
let model;
try {
  model = transformers.languageModel('onnx-community/Llama-3.2-3B-Instruct-ONNX');
} catch (error) {
  console.warn('Primary model failed, using fallback:', error);
  model = transformers.languageModel('onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX');
}

When to Use Llama

Llama models are a strong choice when:

You need text generation - Llama is optimized for generation tasks with models across multiple size tiers.
Browser compatibility matters - Available through 3 providers (webllm, wllama, transformers), ensuring coverage across Chrome, Firefox, Safari, and Edge.
Size flexibility is important - The 400MB–4.92GB range means you can target everything from mobile devices to high-end desktops with the same model family.
Offline functionality is required - All variants work offline after the initial download, cached in IndexedDB via LocalMode's model caching system.

HuggingFace Model Cards

Text Generation - task guide
Smollm2 - model guide
Qwen - model guide

Methodology

Model sizes, context lengths, and quantization formats were verified directly against LocalMode's provider catalogs (packages/webllm/src/models.ts, packages/wllama/src/models.ts, packages/transformers/src/models.ts), which are the canonical source of truth for all LocalMode-supported variants. External facts - native context lengths, parameter counts, and release dates - were cross-checked against official Meta Llama model cards on HuggingFace and Meta's announcement blog. Note that WebLLM and ONNX variants may expose a smaller context window than the model's native maximum (e.g., 4,096 or 8,192 tokens vs the native 128k) due to browser memory constraints; the wllama GGUF variants surface the full 131,072-token context (equivalent to 128k). Always benchmark on your target devices before production deployment.

Sources

meta-llama/Llama-3.2-1B-Instruct model card - context length (128k/131,072 tokens), parameter count (1.23B), release date (Sep 25, 2024), license
meta-llama/Llama-3.2-3B-Instruct model card - context length (128k/131,072 tokens), parameter count (3.21B), release date (Sep 25, 2024)
meta-llama/Llama-3.1-8B-Instruct model card - context length (128k/131,072 tokens), release date (Jul 23, 2024)
TinyLlama/TinyLlama-1.1B-Chat-v1.0 model card - 1.1B parameters
TinyLlama GitHub repository - sequence length 2,048 tokens
Meta Llama 3.2 announcement blog - release details, 128k context
LocalMode WebLLM model catalog - packages/webllm/src/models.ts - sizes and context lengths for MLC variants
LocalMode wllama model catalog - packages/wllama/src/models.ts - sizes and context lengths for GGUF variants
LocalMode transformers model catalog - packages/transformers/src/models.ts - sizes and context lengths for ONNX variants
Transformers.js documentation
WebLLM project
wllama (llama.cpp WASM)

Frequently Asked Questions