LocalMode
Core

Text Generation

Generate and stream text with language models. Supports multimodal vision input (images).

Generate text using local language models with streaming support. All four LLM providers use the same streamText() and generateText() functions from @localmode/core.

See it in action

Try LLM Chat and Research Agent for working demos of these APIs.

Need structured JSON output instead of free text? See the Structured Output guide for generateObject() and streamObject().

Providers

LocalMode ships four LLM providers. All implement the same LanguageModel interface, so you can swap them without changing application code.

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
const result = await streamText({ model, prompt: 'Hello!' });

Fastest inference via WebGPU. 32 curated MLC-compiled models including Phi 3.5 Vision. Requires a WebGPU-capable browser.

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
const result = await streamText({ model, prompt: 'Hello!' });

Runs llama.cpp via WebAssembly. 18 curated default models plus 160K+ GGUF models from HuggingFace. Works in all modern browsers — no GPU required.

import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
const result = await streamText({ model, prompt: 'Hello!' });

16 curated ONNX models (including 5 vision-capable Qwen3.5 and Gemma 4 variants) via Transformers.js v4. WebGPU acceleration with automatic WASM fallback.

import { streamText } from '@localmode/core';
import { litert } from '@localmode/litert';

const model = litert.languageModel('qwen3-0.6B');
const result = await streamText({ model, prompt: 'Hello!' });

Google's LiteRT-LM engine via @litert-lm/core — first-party browser bindings for the inference engine Google uses across its own on-device AI products (Chrome's built-in AI, Chromebook Plus, Pixel Watch Smart Replies). Runs .litertlm models on a WebGPU backend with CPU WASM fallback for portable models. 3 verified models (gemma-4-E2B, gemma-4-E4B, qwen3-0.6B). Text-only.

Provider Comparison

WebLLMwllamaTransformersLiteRT
RuntimeMLC WebGPUllama.cpp WASMTransformers.js v4 ONNXLiteRT-LM (Google)
GPU RequiredYes (WebGPU)NoNo (auto-fallback to WASM)WebGPU for Gemma 4; Qwen3 0.6B runs on CPU WASM too
Browser SupportChrome/Edge 113+, Safari 26+, Firefox 141+All modern browsersChrome/Edge 113+, Safari 26+ (WASM everywhere)WebGPU-capable browsers (Chrome/Edge 113+, Safari 26+, Firefox 141+)
Model Catalog32 curated MLC models18 curated + 160K+ GGUF16 curated ONNX models3 verified models
Model FormatMLC (pre-compiled)GGUF (standard)ONNX.litertlm (Google)
MultimodalPhi 3.5 VisionHolo2 4B/8B (vision)5 vision variants (Qwen3.5 + Gemma 4)Text-only
Best ForMaximum speed on GPU-capable devicesUniversal compatibility, huge model selectionBroad ONNX ecosystem, WebGPU + WASM flexibilityGoogle's officially-supported on-device pipeline (Gemma 4)
StatusStableStableStableEarly preview (@litert-lm/core@^0.12.1)

Automatic Fallback

Use a try/catch chain to try providers in order — fastest first, most compatible last:

import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';
import { litert } from '@localmode/litert';

let model;
try {
  // Google's optimized engine for supported devices
  model = litert.languageModel('qwen3-0.6B');
} catch {
  try {
    // MLC WebGPU — fastest general-purpose path
    model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
  } catch {
    console.warn('WebGPU unavailable, falling back to wllama (WASM)');
    model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
  }
}

For model catalogs, provider-specific configuration, and detailed setup, see the WebLLM, wllama, Transformers Text Generation, and LiteRT guides.

streamText()

Stream text generation for real-time responses:

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

const result = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

With System Prompt

const result = await streamText({
  model,
  system: 'You are a helpful coding assistant. Be concise.',
  prompt: 'Write a function to reverse a string in TypeScript.',
});

Options

interface StreamTextOptions {
  model: LanguageModel;
  prompt: string;
  system?: string;
  maxTokens?: number;
  temperature?: number;
  topP?: number;
  stopSequences?: string[];
  abortSignal?: AbortSignal;
}

Stream Properties

const result = await streamText({ model, prompt: 'Hello' });

// Iterate over text chunks
for await (const chunk of result.stream) {
  console.log(chunk.text);      // The generated text piece
  console.log(chunk.done);      // Whether this is the final chunk
}

// Get full text after streaming
const fullText = await result.text;

// Get usage statistics
const usage = await result.usage;
console.log('Tokens:', usage.totalTokens);

generateText()

Generate complete text without streaming:

import { generateText } from '@localmode/core';

const { text, usage } = await generateText({
  model,
  prompt: 'Write a haiku about programming.',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Options

interface GenerateTextOptions {
  model: LanguageModel;
  prompt: string;
  system?: string;
  maxTokens?: number;
  temperature?: number;
  topP?: number;
  stopSequences?: string[];
  abortSignal?: AbortSignal;
}

Return Value

interface GenerateTextResult {
  text: string;
  finishReason: FinishReason;
  usage: {
    inputTokens: number;
    outputTokens: number;
    totalTokens: number;
    durationMs: number;
  };
  response: {
    modelId: string;
    timestamp: Date;
  };
}

Cancellation

Cancel generation mid-stream:

const controller = new AbortController();

// Cancel after 5 seconds
setTimeout(() => controller.abort(), 5000);

try {
  const result = await streamText({
    model,
    prompt: 'Write a long essay...',
    abortSignal: controller.signal,
  });

  for await (const chunk of result.stream) {
    process.stdout.write(chunk.text);
  }
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('\nGeneration cancelled');
  }
}

Temperature & Sampling

Control randomness in generation:

// More deterministic (good for factual responses)
const result = await streamText({
  model,
  prompt: 'What is 2 + 2?',
  temperature: 0.1,
});

// More creative (good for stories, brainstorming)
const result = await streamText({
  model,
  prompt: 'Write a creative story about a robot.',
  temperature: 0.9,
});

// Nucleus sampling
const result = await streamText({
  model,
  prompt: 'Continue this sentence: The future of AI is...',
  topP: 0.9,  // Consider tokens making up 90% of probability
});
ParameterDescriptionRangeDefault
temperatureRandomness0.0 - 2.01.0
topPNucleus sampling0.0 - 1.01.0
maxTokensMax generation length1 - model maxModel default

Stop Sequences

Stop generation at specific patterns:

const result = await streamText({
  model,
  prompt: 'List three fruits:\n1.',
  stopSequences: ['\n4.', '\n\n'],  // Stop before 4th item or double newline
});

Chat-Style Prompts

Build chat applications:

function buildPrompt(messages: Array<{ role: string; content: string }>) {
  return messages
    .map((m) => `${m.role}: ${m.content}`)
    .join('\n') + '\nassistant:';
}

const messages = [
  { role: 'user', content: 'Hello!' },
  { role: 'assistant', content: 'Hi! How can I help you today?' },
  { role: 'user', content: 'What is TypeScript?' },
];

const result = await streamText({
  model,
  system: 'You are a helpful programming assistant.',
  prompt: buildPrompt(messages),
  stopSequences: ['user:', '\n\n'],
});

RAG Integration

Combine with retrieval:

import { semanticSearch, streamText } from '@localmode/core';

async function ragQuery(question: string) {
  // Retrieve context
  const results = await semanticSearch({ db, model: embeddingModel, query: question, k: 3 });
  const context = results.map((r) => r.metadata.text).join('\n\n');

  // Generate answer
  const result = await streamText({
    model: llm,
    system: 'Answer based only on the provided context.',
    prompt: `Context:\n${context}\n\nQuestion: ${question}\n\nAnswer:`,
  });

  return result;
}

Implementing Custom Models

Create your own language model:

import type { LanguageModel, DoGenerateOptions, DoStreamOptions, StreamChunk } from '@localmode/core';

class MyLanguageModel implements LanguageModel {
  readonly modelId = 'custom:my-model';
  readonly provider = 'custom';
  readonly contextLength = 4096;

  async doGenerate(options: DoGenerateOptions) {
    // Your generation logic
    return {
      text: 'Generated text...',
      finishReason: 'stop' as const,
      usage: { inputTokens: 10, outputTokens: 20, totalTokens: 30, durationMs: 100 },
      response: { modelId: this.modelId, timestamp: new Date() },
    };
  }

  async *doStream(options: DoStreamOptions): AsyncIterable<StreamChunk> {
    yield { text: 'Hello', done: false };
    yield { text: ' world!', done: true, finishReason: 'stop' };
  }
}

Vision & Audio (Multimodal Input)

Send images and audio alongside text to multimodal language models. Content parts use a discriminated union — TextPart | ImagePart | AudioPart — for type-safe multimodal messages.

Content Part Types

import type { ContentPart, TextPart, ImagePart, AudioPart } from '@localmode/core';

// Text part
const text: TextPart = { type: 'text', text: 'What is in this image?' };

// Image part (base64-encoded, no data: prefix)
const image: ImagePart = {
  type: 'image',
  data: 'iVBORw0KGgo...', // base64
  mimeType: 'image/jpeg',
};

// Audio part (base64-encoded, no data: prefix)
const audio: AudioPart = {
  type: 'audio',
  data: 'UklGRiQAAAA...', // base64
  mimeType: 'audio/wav',
};

Vision input is supported by Phi 3.5 Vision (webllm) and Qwen3.5 (transformers).

Sending Images with Messages

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Phi-3.5-vision-instruct-q4f16_1-MLC');

const result = await streamText({
  model,
  prompt: '',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this image in detail.' },
      { type: 'image', data: base64ImageData, mimeType: 'image/jpeg' },
    ],
  }],
});

Checking Vision Support

if (model.supportsVision) {
  // Show image upload UI
}

Content Utilities

import { normalizeContent, getTextContent } from '@localmode/core';

// Convert string → ContentPart[]
normalizeContent('Hello');
// [{ type: 'text', text: 'Hello' }]

// Extract text from mixed content
getTextContent([
  { type: 'text', text: 'Describe' },
  { type: 'image', data: '...', mimeType: 'image/png' },
]);
// 'Describe'

Supported Vision Models

ProviderModelSizeNotes
WebLLMPhi 3.5 Vision2.4GBWebGPU required
Transformers (ONNX)Qwen3.5 0.8B~500MBExperimental, WebGPU recommended
Transformers (ONNX)Qwen3.5 2B~1.5GBExperimental, WebGPU recommended
Transformers (ONNX)Qwen3.5 4B~2.5GBExperimental, WebGPU required

Best Practices

Generation Tips

  1. Stream for UX — Always use streamText() for user-facing apps
  2. Set max tokens — Prevent runaway generation
  3. Use system prompts — Guide model behavior consistently
  4. Handle errors — Wrap generation in try-catch
  5. Provide cancellation — Let users abort long generations

Next Steps

Showcase Apps

AppDescriptionLinks
LLM ChatStream text generation with multiple LLM backends and vision (image input)Demo · Source
PDF SearchGenerate answers from PDF context with streamingDemo · Source
LangChain RAGGenerate answers in a retrieval-augmented pipelineDemo · Source
Data ExtractorExtract structured data with generateObject()Demo · Source
Research AgentMulti-step reasoning with tool-augmented generationDemo · Source

On this page