Text Generation
Generate and stream text with language models. Supports multimodal vision input (images).
Generate text using local language models with streaming support. All four LLM providers use the same streamText() and generateText() functions from @localmode/core.
See it in action
Try LLM Chat and Research Agent for working demos of these APIs.
Need structured JSON output instead of free text? See the Structured Output guide for generateObject() and streamObject().
Providers
LocalMode ships four LLM providers. All implement the same LanguageModel interface, so you can swap them without changing application code.
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
const result = await streamText({ model, prompt: 'Hello!' });Fastest inference via WebGPU. 32 curated MLC-compiled models including Phi 3.5 Vision. Requires a WebGPU-capable browser.
import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
const result = await streamText({ model, prompt: 'Hello!' });Runs llama.cpp via WebAssembly. 18 curated default models plus 160K+ GGUF models from HuggingFace. Works in all modern browsers — no GPU required.
import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
const result = await streamText({ model, prompt: 'Hello!' });16 curated ONNX models (including 5 vision-capable Qwen3.5 and Gemma 4 variants) via Transformers.js v4. WebGPU acceleration with automatic WASM fallback.
import { streamText } from '@localmode/core';
import { litert } from '@localmode/litert';
const model = litert.languageModel('qwen3-0.6B');
const result = await streamText({ model, prompt: 'Hello!' });Google's LiteRT-LM engine via @litert-lm/core — first-party browser bindings for the inference engine Google uses across its own on-device AI products (Chrome's built-in AI, Chromebook Plus, Pixel Watch Smart Replies). Runs .litertlm models on a WebGPU backend with CPU WASM fallback for portable models. 3 verified models (gemma-4-E2B, gemma-4-E4B, qwen3-0.6B). Text-only.
Provider Comparison
| WebLLM | wllama | Transformers | LiteRT | |
|---|---|---|---|---|
| Runtime | MLC WebGPU | llama.cpp WASM | Transformers.js v4 ONNX | LiteRT-LM (Google) |
| GPU Required | Yes (WebGPU) | No | No (auto-fallback to WASM) | WebGPU for Gemma 4; Qwen3 0.6B runs on CPU WASM too |
| Browser Support | Chrome/Edge 113+, Safari 26+, Firefox 141+ | All modern browsers | Chrome/Edge 113+, Safari 26+ (WASM everywhere) | WebGPU-capable browsers (Chrome/Edge 113+, Safari 26+, Firefox 141+) |
| Model Catalog | 32 curated MLC models | 18 curated + 160K+ GGUF | 16 curated ONNX models | 3 verified models |
| Model Format | MLC (pre-compiled) | GGUF (standard) | ONNX | .litertlm (Google) |
| Multimodal | Phi 3.5 Vision | Holo2 4B/8B (vision) | 5 vision variants (Qwen3.5 + Gemma 4) | Text-only |
| Best For | Maximum speed on GPU-capable devices | Universal compatibility, huge model selection | Broad ONNX ecosystem, WebGPU + WASM flexibility | Google's officially-supported on-device pipeline (Gemma 4) |
| Status | Stable | Stable | Stable | Early preview (@litert-lm/core@^0.12.1) |
Automatic Fallback
Use a try/catch chain to try providers in order — fastest first, most compatible last:
import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';
import { litert } from '@localmode/litert';
let model;
try {
// Google's optimized engine for supported devices
model = litert.languageModel('qwen3-0.6B');
} catch {
try {
// MLC WebGPU — fastest general-purpose path
model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
} catch {
console.warn('WebGPU unavailable, falling back to wllama (WASM)');
model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
}
}For model catalogs, provider-specific configuration, and detailed setup, see the WebLLM, wllama, Transformers Text Generation, and LiteRT guides.
streamText()
Stream text generation for real-time responses:
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
const result = await streamText({
model,
prompt: 'Explain quantum computing in simple terms.',
});
for await (const chunk of result.stream) {
process.stdout.write(chunk.text);
}With System Prompt
const result = await streamText({
model,
system: 'You are a helpful coding assistant. Be concise.',
prompt: 'Write a function to reverse a string in TypeScript.',
});Options
interface StreamTextOptions {
model: LanguageModel;
prompt: string;
system?: string;
maxTokens?: number;
temperature?: number;
topP?: number;
stopSequences?: string[];
abortSignal?: AbortSignal;
}Stream Properties
const result = await streamText({ model, prompt: 'Hello' });
// Iterate over text chunks
for await (const chunk of result.stream) {
console.log(chunk.text); // The generated text piece
console.log(chunk.done); // Whether this is the final chunk
}
// Get full text after streaming
const fullText = await result.text;
// Get usage statistics
const usage = await result.usage;
console.log('Tokens:', usage.totalTokens);generateText()
Generate complete text without streaming:
import { generateText } from '@localmode/core';
const { text, usage } = await generateText({
model,
prompt: 'Write a haiku about programming.',
});
console.log(text);
console.log('Tokens used:', usage.totalTokens);Options
interface GenerateTextOptions {
model: LanguageModel;
prompt: string;
system?: string;
maxTokens?: number;
temperature?: number;
topP?: number;
stopSequences?: string[];
abortSignal?: AbortSignal;
}Return Value
interface GenerateTextResult {
text: string;
finishReason: FinishReason;
usage: {
inputTokens: number;
outputTokens: number;
totalTokens: number;
durationMs: number;
};
response: {
modelId: string;
timestamp: Date;
};
}Cancellation
Cancel generation mid-stream:
const controller = new AbortController();
// Cancel after 5 seconds
setTimeout(() => controller.abort(), 5000);
try {
const result = await streamText({
model,
prompt: 'Write a long essay...',
abortSignal: controller.signal,
});
for await (const chunk of result.stream) {
process.stdout.write(chunk.text);
}
} catch (error) {
if (error.name === 'AbortError') {
console.log('\nGeneration cancelled');
}
}Temperature & Sampling
Control randomness in generation:
// More deterministic (good for factual responses)
const result = await streamText({
model,
prompt: 'What is 2 + 2?',
temperature: 0.1,
});
// More creative (good for stories, brainstorming)
const result = await streamText({
model,
prompt: 'Write a creative story about a robot.',
temperature: 0.9,
});
// Nucleus sampling
const result = await streamText({
model,
prompt: 'Continue this sentence: The future of AI is...',
topP: 0.9, // Consider tokens making up 90% of probability
});| Parameter | Description | Range | Default |
|---|---|---|---|
temperature | Randomness | 0.0 - 2.0 | 1.0 |
topP | Nucleus sampling | 0.0 - 1.0 | 1.0 |
maxTokens | Max generation length | 1 - model max | Model default |
Stop Sequences
Stop generation at specific patterns:
const result = await streamText({
model,
prompt: 'List three fruits:\n1.',
stopSequences: ['\n4.', '\n\n'], // Stop before 4th item or double newline
});Chat-Style Prompts
Build chat applications:
function buildPrompt(messages: Array<{ role: string; content: string }>) {
return messages
.map((m) => `${m.role}: ${m.content}`)
.join('\n') + '\nassistant:';
}
const messages = [
{ role: 'user', content: 'Hello!' },
{ role: 'assistant', content: 'Hi! How can I help you today?' },
{ role: 'user', content: 'What is TypeScript?' },
];
const result = await streamText({
model,
system: 'You are a helpful programming assistant.',
prompt: buildPrompt(messages),
stopSequences: ['user:', '\n\n'],
});RAG Integration
Combine with retrieval:
import { semanticSearch, streamText } from '@localmode/core';
async function ragQuery(question: string) {
// Retrieve context
const results = await semanticSearch({ db, model: embeddingModel, query: question, k: 3 });
const context = results.map((r) => r.metadata.text).join('\n\n');
// Generate answer
const result = await streamText({
model: llm,
system: 'Answer based only on the provided context.',
prompt: `Context:\n${context}\n\nQuestion: ${question}\n\nAnswer:`,
});
return result;
}Implementing Custom Models
Create your own language model:
import type { LanguageModel, DoGenerateOptions, DoStreamOptions, StreamChunk } from '@localmode/core';
class MyLanguageModel implements LanguageModel {
readonly modelId = 'custom:my-model';
readonly provider = 'custom';
readonly contextLength = 4096;
async doGenerate(options: DoGenerateOptions) {
// Your generation logic
return {
text: 'Generated text...',
finishReason: 'stop' as const,
usage: { inputTokens: 10, outputTokens: 20, totalTokens: 30, durationMs: 100 },
response: { modelId: this.modelId, timestamp: new Date() },
};
}
async *doStream(options: DoStreamOptions): AsyncIterable<StreamChunk> {
yield { text: 'Hello', done: false };
yield { text: ' world!', done: true, finishReason: 'stop' };
}
}Vision & Audio (Multimodal Input)
Send images and audio alongside text to multimodal language models. Content parts use a discriminated union — TextPart | ImagePart | AudioPart — for type-safe multimodal messages.
Content Part Types
import type { ContentPart, TextPart, ImagePart, AudioPart } from '@localmode/core';
// Text part
const text: TextPart = { type: 'text', text: 'What is in this image?' };
// Image part (base64-encoded, no data: prefix)
const image: ImagePart = {
type: 'image',
data: 'iVBORw0KGgo...', // base64
mimeType: 'image/jpeg',
};
// Audio part (base64-encoded, no data: prefix)
const audio: AudioPart = {
type: 'audio',
data: 'UklGRiQAAAA...', // base64
mimeType: 'audio/wav',
};Vision input is supported by Phi 3.5 Vision (webllm) and Qwen3.5 (transformers).
Sending Images with Messages
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const model = webllm.languageModel('Phi-3.5-vision-instruct-q4f16_1-MLC');
const result = await streamText({
model,
prompt: '',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image in detail.' },
{ type: 'image', data: base64ImageData, mimeType: 'image/jpeg' },
],
}],
});Checking Vision Support
if (model.supportsVision) {
// Show image upload UI
}Content Utilities
import { normalizeContent, getTextContent } from '@localmode/core';
// Convert string → ContentPart[]
normalizeContent('Hello');
// [{ type: 'text', text: 'Hello' }]
// Extract text from mixed content
getTextContent([
{ type: 'text', text: 'Describe' },
{ type: 'image', data: '...', mimeType: 'image/png' },
]);
// 'Describe'Supported Vision Models
| Provider | Model | Size | Notes |
|---|---|---|---|
| WebLLM | Phi 3.5 Vision | 2.4GB | WebGPU required |
| Transformers (ONNX) | Qwen3.5 0.8B | ~500MB | Experimental, WebGPU recommended |
| Transformers (ONNX) | Qwen3.5 2B | ~1.5GB | Experimental, WebGPU recommended |
| Transformers (ONNX) | Qwen3.5 4B | ~2.5GB | Experimental, WebGPU required |
Best Practices
Generation Tips
- Stream for UX — Always use
streamText()for user-facing apps - Set max tokens — Prevent runaway generation
- Use system prompts — Guide model behavior consistently
- Handle errors — Wrap generation in try-catch
- Provide cancellation — Let users abort long generations
Next Steps
Semantic Cache
Cache LLM responses using embedding similarity for 100-600x speedup.
Language Model Middleware
Add caching, logging, and guardrails to language models.
WebLLM
WebGPU-accelerated LLM inference — 32 curated models.
wllama
GGUF models via llama.cpp WASM — universal browser support.
Transformers Text Generation
16 ONNX LLMs (5 vision) plus Kokoro TTS and generative OCR via Transformers.js v4.
RAG
Build retrieval-augmented generation pipelines.
Showcase Apps
| App | Description | Links |
|---|---|---|
| LLM Chat | Stream text generation with multiple LLM backends and vision (image input) | Demo · Source |
| PDF Search | Generate answers from PDF context with streaming | Demo · Source |
| LangChain RAG | Generate answers in a retrieval-augmented pipeline | Demo · Source |
| Data Extractor | Extract structured data with generateObject() | Demo · Source |
| Research Agent | Multi-step reasoning with tool-augmented generation | Demo · Source |