Text Generation
Generate and stream text with language models. Supports multimodal vision input (images).
Generate text using local language models with streaming support. All three LLM providers use the same streamText() and generateText() functions from @localmode/core.
See it in action
Try LLM Chat and Research Agent for working demos of these APIs.
Need structured JSON output instead of free text? See the Structured Output guide for generateObject() and streamObject().
Providers
LocalMode ships three LLM providers. All implement the same LanguageModel interface, so you can swap them without changing application code.
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
const result = await streamText({ model, prompt: 'Hello!' });Fastest inference via WebGPU. 23 curated MLC-compiled models. Requires a WebGPU-capable browser.
import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
const result = await streamText({ model, prompt: 'Hello!' });Runs llama.cpp via WebAssembly. 135K+ GGUF models from HuggingFace. Works in all modern browsers — no GPU required.
import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
const result = await streamText({ model, prompt: 'Hello!' });ONNX models via Transformers.js v4 (experimental). WebGPU acceleration with automatic WASM fallback.
Provider Comparison
| WebLLM | wllama | Transformers | |
|---|---|---|---|
| Runtime | MLC WebGPU | llama.cpp WASM | Transformers.js v4 ONNX |
| Speed | 60–100 tok/s | 5–15 tok/s | 40–60 tok/s |
| GPU Required | Yes (WebGPU) | No | No (auto-fallback to WASM) |
| Browser Support | Chrome/Edge 113+, Safari 18+ | All modern browsers | Chrome/Edge 113+, Safari 18+ (WASM everywhere) |
| Model Catalog | 23 curated MLC models | 16 curated + 135K+ GGUF models | 14 curated ONNX models |
| Model Format | MLC (pre-compiled) | GGUF (standard) | ONNX |
| Best For | Maximum speed on GPU-capable devices | Universal compatibility, huge model selection | Broad ONNX ecosystem, WebGPU + WASM flexibility |
| Status | Stable | Stable | Experimental |
Automatic Fallback
Use createProviderWithFallback() to try providers in order — fastest first, most compatible last:
import { createProviderWithFallback } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
import { wllama } from '@localmode/wllama';
const model = await createProviderWithFallback({
providers: [
() => webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
() => transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX'),
() => wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
],
onFallback: (error, idx) => console.warn(`Provider ${idx} failed:`, error),
});For model catalogs, provider-specific configuration, and detailed setup, see the WebLLM, wllama, and Transformers Text Generation guides.
streamText()
Stream text generation for real-time responses:
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
const result = await streamText({
model,
prompt: 'Explain quantum computing in simple terms.',
});
for await (const chunk of result.stream) {
process.stdout.write(chunk.text);
}With System Prompt
const result = await streamText({
model,
system: 'You are a helpful coding assistant. Be concise.',
prompt: 'Write a function to reverse a string in TypeScript.',
});Options
interface StreamTextOptions {
model: LanguageModel;
prompt: string;
system?: string;
maxTokens?: number;
temperature?: number;
topP?: number;
stopSequences?: string[];
abortSignal?: AbortSignal;
}Stream Properties
const result = await streamText({ model, prompt: 'Hello' });
// Iterate over text chunks
for await (const chunk of result.stream) {
console.log(chunk.text); // The generated text piece
console.log(chunk.done); // Whether this is the final chunk
}
// Get full text after streaming
const fullText = await result.text;
// Get usage statistics
const usage = await result.usage;
console.log('Tokens:', usage.totalTokens);generateText()
Generate complete text without streaming:
import { generateText } from '@localmode/core';
const { text, usage } = await generateText({
model,
prompt: 'Write a haiku about programming.',
});
console.log(text);
console.log('Tokens used:', usage.totalTokens);Options
interface GenerateTextOptions {
model: LanguageModel;
prompt: string;
system?: string;
maxTokens?: number;
temperature?: number;
topP?: number;
stopSequences?: string[];
abortSignal?: AbortSignal;
}Return Value
interface GenerateTextResult {
text: string;
finishReason: FinishReason;
usage: {
inputTokens: number;
outputTokens: number;
totalTokens: number;
durationMs: number;
};
response: {
modelId: string;
timestamp: Date;
};
}Cancellation
Cancel generation mid-stream:
const controller = new AbortController();
// Cancel after 5 seconds
setTimeout(() => controller.abort(), 5000);
try {
const result = await streamText({
model,
prompt: 'Write a long essay...',
abortSignal: controller.signal,
});
for await (const chunk of result.stream) {
process.stdout.write(chunk.text);
}
} catch (error) {
if (error.name === 'AbortError') {
console.log('\nGeneration cancelled');
}
}Temperature & Sampling
Control randomness in generation:
// More deterministic (good for factual responses)
const result = await streamText({
model,
prompt: 'What is 2 + 2?',
temperature: 0.1,
});
// More creative (good for stories, brainstorming)
const result = await streamText({
model,
prompt: 'Write a creative story about a robot.',
temperature: 0.9,
});
// Nucleus sampling
const result = await streamText({
model,
prompt: 'Continue this sentence: The future of AI is...',
topP: 0.9, // Consider tokens making up 90% of probability
});| Parameter | Description | Range | Default |
|---|---|---|---|
temperature | Randomness | 0.0 - 2.0 | 1.0 |
topP | Nucleus sampling | 0.0 - 1.0 | 1.0 |
maxTokens | Max generation length | 1 - model max | Model default |
Stop Sequences
Stop generation at specific patterns:
const result = await streamText({
model,
prompt: 'List three fruits:\n1.',
stopSequences: ['\n4.', '\n\n'], // Stop before 4th item or double newline
});Chat-Style Prompts
Build chat applications:
function buildPrompt(messages: Array<{ role: string; content: string }>) {
return messages
.map((m) => `${m.role}: ${m.content}`)
.join('\n') + '\nassistant:';
}
const messages = [
{ role: 'user', content: 'Hello!' },
{ role: 'assistant', content: 'Hi! How can I help you today?' },
{ role: 'user', content: 'What is TypeScript?' },
];
const result = await streamText({
model,
system: 'You are a helpful programming assistant.',
prompt: buildPrompt(messages),
stopSequences: ['user:', '\n\n'],
});RAG Integration
Combine with retrieval:
import { semanticSearch, streamText } from '@localmode/core';
async function ragQuery(question: string) {
// Retrieve context
const results = await semanticSearch({ db, model: embeddingModel, query: question, k: 3 });
const context = results.map((r) => r.metadata.text).join('\n\n');
// Generate answer
const result = await streamText({
model: llm,
system: 'Answer based only on the provided context.',
prompt: `Context:\n${context}\n\nQuestion: ${question}\n\nAnswer:`,
});
return result;
}Implementing Custom Models
Create your own language model:
import type { LanguageModel, DoGenerateOptions, DoStreamOptions, StreamChunk } from '@localmode/core';
class MyLanguageModel implements LanguageModel {
readonly modelId = 'custom:my-model';
readonly provider = 'custom';
readonly contextLength = 4096;
async doGenerate(options: DoGenerateOptions) {
// Your generation logic
return {
text: 'Generated text...',
finishReason: 'stop' as const,
usage: { inputTokens: 10, outputTokens: 20, totalTokens: 30, durationMs: 100 },
response: { modelId: this.modelId, timestamp: new Date() },
};
}
async *doStream(options: DoStreamOptions): AsyncIterable<StreamChunk> {
yield { text: 'Hello', done: false };
yield { text: ' world!', done: true, finishReason: 'stop' };
}
}Vision (Multimodal Input)
Send images alongside text to vision-capable language models. Content parts use a discriminated union — TextPart | ImagePart — for type-safe multimodal messages.
Content Part Types
import type { ContentPart, TextPart, ImagePart } from '@localmode/core';
// Text part
const text: TextPart = { type: 'text', text: 'What is in this image?' };
// Image part (base64-encoded, no data: prefix)
const image: ImagePart = {
type: 'image',
data: 'iVBORw0KGgo...', // base64
mimeType: 'image/jpeg',
};Sending Images with Messages
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const model = webllm.languageModel('Phi-3.5-vision-instruct-q4f16_1-MLC');
const result = await streamText({
model,
prompt: '',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image in detail.' },
{ type: 'image', data: base64ImageData, mimeType: 'image/jpeg' },
],
}],
});Checking Vision Support
if (model.supportsVision) {
// Show image upload UI
}Content Utilities
import { normalizeContent, getTextContent } from '@localmode/core';
// Convert string → ContentPart[]
normalizeContent('Hello');
// [{ type: 'text', text: 'Hello' }]
// Extract text from mixed content
getTextContent([
{ type: 'text', text: 'Describe' },
{ type: 'image', data: '...', mimeType: 'image/png' },
]);
// 'Describe'Supported Vision Models
| Provider | Model | Size | Notes |
|---|---|---|---|
| WebLLM | Phi 3.5 Vision | 2.4GB | WebGPU required |
| Transformers (ONNX) | Qwen3.5 0.8B | ~500MB | Experimental, WebGPU recommended |
| Transformers (ONNX) | Qwen3.5 2B | ~1.5GB | Experimental, WebGPU recommended |
| Transformers (ONNX) | Qwen3.5 4B | ~2.5GB | Experimental, WebGPU required |
Best Practices
Generation Tips
- Stream for UX — Always use
streamText()for user-facing apps - Set max tokens — Prevent runaway generation
- Use system prompts — Guide model behavior consistently
- Handle errors — Wrap generation in try-catch
- Provide cancellation — Let users abort long generations
Next Steps
Semantic Cache
Cache LLM responses using embedding similarity for 100-600x speedup.
Language Model Middleware
Add caching, logging, and guardrails to language models.
WebLLM
WebGPU-accelerated LLM inference — 23 curated models.
wllama
GGUF models via llama.cpp WASM — universal browser support.
Transformers Text Generation
ONNX models via Transformers.js v4 (experimental).
RAG
Build retrieval-augmented generation pipelines.
Showcase Apps
| App | Description | Links |
|---|---|---|
| LLM Chat | Stream text generation with multiple LLM backends and vision (image input) | Demo · Source |
| PDF Search | Generate answers from PDF context with streaming | Demo · Source |
| LangChain RAG | Generate answers in a retrieval-augmented pipeline | Demo · Source |
| Data Extractor | Extract structured data with generateObject() | Demo · Source |
| Research Agent | Multi-step reasoning with tool-augmented generation | Demo · Source |