← Back to Tasks

Text Generation (LLM Chat) in the Browser

Generate text, answer questions, and have conversations using local LLMs running entirely in the browser.

Text Generation (LLM Chat) in the Browser

Generate text, answer questions, and have conversations using local LLMs running entirely in the browser.

What Is Text Generation (LLM Chat)?

Text generation uses large language models (LLMs) to produce human-like text from prompts. In the browser, these models run via WebGPU shader compilation (WebLLM), llama.cpp WASM (wllama), ONNX Runtime (Transformers.js v4), or the LiteRT-LM engine from Google (litert). Models range from 70MB to 5GB, with quality scaling proportionally. The same LanguageModel interface works across every provider, so switching backends requires changing only the model ID.

This capability is exposed through the streamText() function in @localmode/core. All processing runs entirely in the browser - no server, no API key, no data leaves the device. After the initial model download, text generation (llm chat) works completely offline.

Real-World Applications

AI chatbots and customer support widgets. Content drafting and writing assistants. Code generation and debugging tools. Tutoring systems and educational apps. Data analysis with natural language queries. Agentic workflows with tool use.

These use cases all benefit from local, on-device processing: user data stays private, there are no per-request API costs, and the application works without internet after initial setup.

Getting Started

Install the required packages:

npm install @localmode/core @localmode/transformers

Import the core function and provider:

import { streamText, generateText, generateObject } from '@localmode/core';
import { webllm } from '@localmode/webllm';

The recommended starting model is Llama-3.2-1B-Instruct-q4f16_1-MLC - it provides the best balance of quality, speed, and download size for most applications.

Code Example

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

// Stream text generation
const result = await streamText({
  model,
  prompt: 'Explain how vector databases work in 3 sentences.',
  maxTokens: 200,
  temperature: 0.7,
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

// Or use wllama for universal browser support (no WebGPU needed)
import { wllama } from '@localmode/wllama';
const wasmModel = wllama.languageModel('Qwen2.5-0.5B-Instruct-Q4_K_M');

This example demonstrates the core workflow: create a model instance from the provider, call the streamText() function with your input, and receive structured results. The same pattern works identically across all 4 available providers: WebLLM (WebGPU), wllama (WASM), Transformers.js (ONNX), and LiteRT.

Available Models

The following models support text generation (llm chat) through LocalMode. Choose based on your target device, acceptable download size, and quality requirements.

ModelProviderSizeSpeedQuality
Qwen2.5-0.5B-Instruct-q4f16_1-MLCWebLLM (WebGPU)278MBFastGood
Qwen3-0.6B-q4f16_1-MLCWebLLM (WebGPU)350MBFastGood
Qwen2.5-1.5B-Instruct-q4f16_1-MLCWebLLM (WebGPU)868MBMediumGood
Qwen2.5-Coder-1.5B-Instruct-q4f16_1-MLCWebLLM (WebGPU)868MBMediumGood
Qwen3-1.7B-q4f16_1-MLCWebLLM (WebGPU)1.1GBMediumGood
Qwen2.5-3B-Instruct-q4f16_1-MLCWebLLM (WebGPU)1.7GBMediumHigh
Qwen2.5-Coder-3B-Instruct-q4f16_1-MLCWebLLM (WebGPU)1.7GBMediumHigh
Qwen3-4B-q4f16_1-MLCWebLLM (WebGPU)2.2GBSlowHigh
Qwen2.5-7B-Instruct-q4f16_1-MLCWebLLM (WebGPU)4.0GBSlowHigh
Qwen2.5-Coder-7B-Instruct-q4f16_1-MLCWebLLM (WebGPU)4.0GBSlowHigh
Qwen3-8B-q4f16_1-MLCWebLLM (WebGPU)4.5GBSlowHigh
Qwen2.5-0.5B-Instruct-Q4_K_Mwllama (WASM)386MBFastGood
Qwen2.5-1.5B-Instruct-Q4_K_Mwllama (WASM)986MBMediumGood
Qwen2.5-Coder-1.5B-Instruct-Q4_K_Mwllama (WASM)1.0GBMediumGood
Qwen2.5-3B-Instruct-Q4_K_Mwllama (WASM)1.94GBMediumHigh
Qwen2.5-Coder-7B-Instruct-Q4_K_Mwllama (WASM)4.5GBSlowHigh
onnx-community/Qwen3-0.6B-ONNXTransformers.js570MBMediumGood
onnx-community/Qwen3.5-0.8B-ONNXTransformers.js500MBFastGood
onnx-community/Qwen3.5-2B-ONNXTransformers.js1.5GBMediumHigh
onnx-community/Qwen3.5-4B-ONNXTransformers.js2.5GBSlowHigh
onnx-community/Qwen3-4B-ONNXTransformers.js1.2GBMediumHigh
onnx-community/Qwen2.5-Coder-1.5B-InstructTransformers.js450MBMediumGood
TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLCWebLLM (WebGPU)400MBFastBasic
Llama-3.2-1B-Instruct-q4f16_1-MLCWebLLM (WebGPU)712MBMediumGood
Llama-3.2-3B-Instruct-q4f16_1-MLCWebLLM (WebGPU)1.76GBSlowHigh
Llama-3.1-8B-Instruct-q4f16_1-MLCWebLLM (WebGPU)4.5GBSlowHigh
TinyLlama-1.1B-Chat-Q4_K_Mwllama (WASM)670MBMediumGood
Llama-3.2-1B-Instruct-Q4_K_Mwllama (WASM)750MBMediumGood
Llama-3.2-3B-Instruct-Q4_K_Mwllama (WASM)1.93GBMediumHigh
Llama-3.1-8B-Instruct-Q4_K_Mwllama (WASM)4.92GBSlowHigh
onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNXTransformers.js350MBMediumBasic
onnx-community/Llama-3.2-1B-Instruct-ONNXTransformers.js380MBMediumGood
onnx-community/Llama-3.2-3B-Instruct-ONNXTransformers.js900MBMediumHigh
Phi-3.5-mini-instruct-q4f16_1-MLCWebLLM (WebGPU)2.1GBSlowHigh
Phi-3-mini-4k-instruct-q4f16_1-MLCWebLLM (WebGPU)2.2GBSlowHigh
Phi-3.5-vision-instruct-q4f16_1-MLCWebLLM (WebGPU)2.4GBSlowHigh
Phi-3.5-mini-instruct-Q4_K_Mwllama (WASM)1.24GBMediumHigh
Phi-4-mini-instruct-Q4_K_Mwllama (WASM)2.3GBMediumHigh
onnx-community/Phi-4-mini-instruct-web-q4f16Transformers.js2.3GBSlowHigh
microsoft/Phi-3-mini-4k-instruct-onnx-webTransformers.js1.2GBMediumHigh
SmolLM2-135M-Instruct-q0f16-MLCWebLLM (WebGPU)78MBFastBasic
SmolLM2-360M-Instruct-q4f16_1-MLCWebLLM (WebGPU)210MBFastBasic
SmolLM2-1.7B-Instruct-q4f16_1-MLCWebLLM (WebGPU)1.0GBMediumGood
SmolLM2-135M-Instruct-Q4_K_Mwllama (WASM)70MBFastBasic
SmolLM2-360M-Instruct-Q4_K_Mwllama (WASM)234MBFastBasic
SmolLM2-1.7B-Instruct-Q4_K_Mwllama (WASM)1.06GBMediumGood
gemma-2-2b-it-q4f16_1-MLCWebLLM (WebGPU)1.44GBMediumGood
gemma-2-9b-it-q4f16_1-MLCWebLLM (WebGPU)5.0GBSlowHigh
Gemma-2-2B-IT-Q4_K_Mwllama (WASM)1.3GBMediumGood
Ministral-3-3B-Instruct-2512-BF16-q4f16_1-MLCWebLLM (WebGPU)1.8GBMediumGood
Ministral-3-3B-Reasoning-2512-q4f16_1-MLCWebLLM (WebGPU)1.8GBMediumGood
Mistral-7B-Instruct-v0.3-q4f16_1-MLCWebLLM (WebGPU)4.0GBSlowHigh
Mistral-7B-Instruct-v0.3-Q4_K_Mwllama (WASM)4.37GBSlowHigh
DeepSeek-R1-Distill-Qwen-7B-q4f16_1-MLCWebLLM (WebGPU)4.2GBSlowHigh
DeepSeek-R1-Distill-Llama-8B-q4f16_1-MLCWebLLM (WebGPU)4.4GBSlowHigh
onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNXTransformers.js500MBMediumGood
Hermes-3-Llama-3.2-3B-q4f16_1-MLCWebLLM (WebGPU)1.76GBMediumHigh
Hermes-3-Llama-3.1-8B-q4f16_1-MLCWebLLM (WebGPU)4.9GBSlowHigh
onnx-community/granite-4.0-350m-ONNX-webTransformers.js120MBFastBasic
onnx-community/granite-4.0-1b-ONNX-webTransformers.js350MBFastBasic
onnx-community/gemma-4-E2B-it-ONNXTransformers.js1.5GBMediumHigh
onnx-community/gemma-4-E4B-it-ONNXTransformers.js3GBSlowHigh

Choosing a model: For most applications, start with the recommended model (Llama-3.2-1B-Instruct-q4f16_1-MLC). If download size is the primary constraint (e.g., mobile PWA, browser extension), pick the smallest model that meets your quality bar. If quality is the priority (e.g., enterprise search, content analysis), use the largest model your target devices can handle.

Cloud vs Local: Cost and Privacy Comparison

Running text generation (llm chat) locally eliminates per-request API costs and keeps all data on-device. Here is how the economics compare:

ServiceCost / Notes
OpenAI GPT-4o$2.50-10 per million tokens
Claude 3.5 Sonnet$3-15 per million tokens
Local browser LLMs$0 per token

OpenAI GPT-4o costs $2.50-10 per million tokens. Claude 3.5 Sonnet costs $3-15 per million tokens. Local browser LLMs cost $0 per token. The quality gap has narrowed significantly: Qwen3-4B scores 83.7% on MMLU-Redux (thinking mode), within 10% of GPT-4o. For many applications - chat, summarization, classification-via-LLM - local models are sufficient.

The break-even point for most applications is low: if you process more than a few hundred requests per day, local inference costs less than any cloud API within the first week. For privacy-sensitive applications (medical records, legal documents, financial data), the cost comparison is secondary - the ability to process data without it ever leaving the device is the primary value.

Available Providers

  • WebLLM (WebGPU) - GPU-accelerated via WebGPU compute shaders. Fastest inference. Requires Chrome 113+, Edge 113+, or Safari 26+.
  • wllama (WASM) - WASM-based inference via llama.cpp. Works in every modern browser including Firefox. Universal compatibility.
  • Transformers.js - ONNX-optimized models via ONNX Runtime Web. Supports both WebGPU and WASM backends. Broadest model catalog for non-LLM tasks.
  • LiteRT - Google's on-device LiteRT-LM engine for .litertlm models. Text-only. Gemma 4 models require WebGPU; Qwen3 0.6B runs on WebGPU or CPU.

AbortSignal Support

All streamText() calls support cancellation through the standard AbortSignal API:

const controller = new AbortController();

const promise = streamText({
  model,
  prompt: 'input text',
  abortSignal: controller.signal,
});

// Cancel if needed (e.g., user navigates away)
controller.abort();

This is essential for responsive UIs - cancel in-flight operations when the user navigates away, submits a new query, or closes a dialog. The underlying model inference stops immediately, freeing memory and compute resources.

React Integration

If you are building a React application, @localmode/react provides hooks that manage loading states, error handling, and cancellation automatically:

npm install @localmode/react
import { useGenerateText } from '@localmode/react';

The hook returns { data, error, isLoading, execute, cancel, reset } - providing everything a UI component needs to display progress, handle errors, offer cancellation, and reset state.

Methodology

Model IDs, sizes, and descriptions were verified directly against the LocalMode source catalogs: packages/webllm/src/models.ts (32 WebLLM models), packages/wllama/src/models.ts (18 GGUF models), packages/transformers/src/models.ts (16 ONNX LLMs), and packages/litert/src/models.ts (3 LiteRT models). Hook return shapes were verified against packages/react/src/hooks/use-generate-text.ts and packages/react/src/core/use-operation.ts. Cloud pricing was fetched from official provider pricing pages at time of writing and is subject to change. The Qwen3-4B MMLU-Redux score of 83.7 is drawn from the Qwen3 technical report (arXiv 2505.09388); benchmark with your own data for production decisions.

Sources