LocalMode
Core

Text Generation

Generate and stream text with language models. Supports multimodal vision input (images).

Generate text using local language models with streaming support. All three LLM providers use the same streamText() and generateText() functions from @localmode/core.

See it in action

Try LLM Chat and Research Agent for working demos of these APIs.

Need structured JSON output instead of free text? See the Structured Output guide for generateObject() and streamObject().

Providers

LocalMode ships three LLM providers. All implement the same LanguageModel interface, so you can swap them without changing application code.

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
const result = await streamText({ model, prompt: 'Hello!' });

Fastest inference via WebGPU. 23 curated MLC-compiled models. Requires a WebGPU-capable browser.

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
const result = await streamText({ model, prompt: 'Hello!' });

Runs llama.cpp via WebAssembly. 135K+ GGUF models from HuggingFace. Works in all modern browsers — no GPU required.

import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
const result = await streamText({ model, prompt: 'Hello!' });

ONNX models via Transformers.js v4 (experimental). WebGPU acceleration with automatic WASM fallback.

Provider Comparison

WebLLMwllamaTransformers
RuntimeMLC WebGPUllama.cpp WASMTransformers.js v4 ONNX
Speed60–100 tok/s5–15 tok/s40–60 tok/s
GPU RequiredYes (WebGPU)NoNo (auto-fallback to WASM)
Browser SupportChrome/Edge 113+, Safari 18+All modern browsersChrome/Edge 113+, Safari 18+ (WASM everywhere)
Model Catalog23 curated MLC models16 curated + 135K+ GGUF models14 curated ONNX models
Model FormatMLC (pre-compiled)GGUF (standard)ONNX
Best ForMaximum speed on GPU-capable devicesUniversal compatibility, huge model selectionBroad ONNX ecosystem, WebGPU + WASM flexibility
StatusStableStableExperimental

Automatic Fallback

Use createProviderWithFallback() to try providers in order — fastest first, most compatible last:

import { createProviderWithFallback } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
import { wllama } from '@localmode/wllama';

const model = await createProviderWithFallback({
  providers: [
    () => webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
    () => transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX'),
    () => wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
  ],
  onFallback: (error, idx) => console.warn(`Provider ${idx} failed:`, error),
});

For model catalogs, provider-specific configuration, and detailed setup, see the WebLLM, wllama, and Transformers Text Generation guides.

streamText()

Stream text generation for real-time responses:

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

const result = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

With System Prompt

const result = await streamText({
  model,
  system: 'You are a helpful coding assistant. Be concise.',
  prompt: 'Write a function to reverse a string in TypeScript.',
});

Options

interface StreamTextOptions {
  model: LanguageModel;
  prompt: string;
  system?: string;
  maxTokens?: number;
  temperature?: number;
  topP?: number;
  stopSequences?: string[];
  abortSignal?: AbortSignal;
}

Stream Properties

const result = await streamText({ model, prompt: 'Hello' });

// Iterate over text chunks
for await (const chunk of result.stream) {
  console.log(chunk.text);      // The generated text piece
  console.log(chunk.done);      // Whether this is the final chunk
}

// Get full text after streaming
const fullText = await result.text;

// Get usage statistics
const usage = await result.usage;
console.log('Tokens:', usage.totalTokens);

generateText()

Generate complete text without streaming:

import { generateText } from '@localmode/core';

const { text, usage } = await generateText({
  model,
  prompt: 'Write a haiku about programming.',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Options

interface GenerateTextOptions {
  model: LanguageModel;
  prompt: string;
  system?: string;
  maxTokens?: number;
  temperature?: number;
  topP?: number;
  stopSequences?: string[];
  abortSignal?: AbortSignal;
}

Return Value

interface GenerateTextResult {
  text: string;
  finishReason: FinishReason;
  usage: {
    inputTokens: number;
    outputTokens: number;
    totalTokens: number;
    durationMs: number;
  };
  response: {
    modelId: string;
    timestamp: Date;
  };
}

Cancellation

Cancel generation mid-stream:

const controller = new AbortController();

// Cancel after 5 seconds
setTimeout(() => controller.abort(), 5000);

try {
  const result = await streamText({
    model,
    prompt: 'Write a long essay...',
    abortSignal: controller.signal,
  });

  for await (const chunk of result.stream) {
    process.stdout.write(chunk.text);
  }
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('\nGeneration cancelled');
  }
}

Temperature & Sampling

Control randomness in generation:

// More deterministic (good for factual responses)
const result = await streamText({
  model,
  prompt: 'What is 2 + 2?',
  temperature: 0.1,
});

// More creative (good for stories, brainstorming)
const result = await streamText({
  model,
  prompt: 'Write a creative story about a robot.',
  temperature: 0.9,
});

// Nucleus sampling
const result = await streamText({
  model,
  prompt: 'Continue this sentence: The future of AI is...',
  topP: 0.9,  // Consider tokens making up 90% of probability
});
ParameterDescriptionRangeDefault
temperatureRandomness0.0 - 2.01.0
topPNucleus sampling0.0 - 1.01.0
maxTokensMax generation length1 - model maxModel default

Stop Sequences

Stop generation at specific patterns:

const result = await streamText({
  model,
  prompt: 'List three fruits:\n1.',
  stopSequences: ['\n4.', '\n\n'],  // Stop before 4th item or double newline
});

Chat-Style Prompts

Build chat applications:

function buildPrompt(messages: Array<{ role: string; content: string }>) {
  return messages
    .map((m) => `${m.role}: ${m.content}`)
    .join('\n') + '\nassistant:';
}

const messages = [
  { role: 'user', content: 'Hello!' },
  { role: 'assistant', content: 'Hi! How can I help you today?' },
  { role: 'user', content: 'What is TypeScript?' },
];

const result = await streamText({
  model,
  system: 'You are a helpful programming assistant.',
  prompt: buildPrompt(messages),
  stopSequences: ['user:', '\n\n'],
});

RAG Integration

Combine with retrieval:

import { semanticSearch, streamText } from '@localmode/core';

async function ragQuery(question: string) {
  // Retrieve context
  const results = await semanticSearch({ db, model: embeddingModel, query: question, k: 3 });
  const context = results.map((r) => r.metadata.text).join('\n\n');

  // Generate answer
  const result = await streamText({
    model: llm,
    system: 'Answer based only on the provided context.',
    prompt: `Context:\n${context}\n\nQuestion: ${question}\n\nAnswer:`,
  });

  return result;
}

Implementing Custom Models

Create your own language model:

import type { LanguageModel, DoGenerateOptions, DoStreamOptions, StreamChunk } from '@localmode/core';

class MyLanguageModel implements LanguageModel {
  readonly modelId = 'custom:my-model';
  readonly provider = 'custom';
  readonly contextLength = 4096;

  async doGenerate(options: DoGenerateOptions) {
    // Your generation logic
    return {
      text: 'Generated text...',
      finishReason: 'stop' as const,
      usage: { inputTokens: 10, outputTokens: 20, totalTokens: 30, durationMs: 100 },
      response: { modelId: this.modelId, timestamp: new Date() },
    };
  }

  async *doStream(options: DoStreamOptions): AsyncIterable<StreamChunk> {
    yield { text: 'Hello', done: false };
    yield { text: ' world!', done: true, finishReason: 'stop' };
  }
}

Vision (Multimodal Input)

Send images alongside text to vision-capable language models. Content parts use a discriminated union — TextPart | ImagePart — for type-safe multimodal messages.

Content Part Types

import type { ContentPart, TextPart, ImagePart } from '@localmode/core';

// Text part
const text: TextPart = { type: 'text', text: 'What is in this image?' };

// Image part (base64-encoded, no data: prefix)
const image: ImagePart = {
  type: 'image',
  data: 'iVBORw0KGgo...', // base64
  mimeType: 'image/jpeg',
};

Sending Images with Messages

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Phi-3.5-vision-instruct-q4f16_1-MLC');

const result = await streamText({
  model,
  prompt: '',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this image in detail.' },
      { type: 'image', data: base64ImageData, mimeType: 'image/jpeg' },
    ],
  }],
});

Checking Vision Support

if (model.supportsVision) {
  // Show image upload UI
}

Content Utilities

import { normalizeContent, getTextContent } from '@localmode/core';

// Convert string → ContentPart[]
normalizeContent('Hello');
// [{ type: 'text', text: 'Hello' }]

// Extract text from mixed content
getTextContent([
  { type: 'text', text: 'Describe' },
  { type: 'image', data: '...', mimeType: 'image/png' },
]);
// 'Describe'

Supported Vision Models

ProviderModelSizeNotes
WebLLMPhi 3.5 Vision2.4GBWebGPU required
Transformers (ONNX)Qwen3.5 0.8B~500MBExperimental, WebGPU recommended
Transformers (ONNX)Qwen3.5 2B~1.5GBExperimental, WebGPU recommended
Transformers (ONNX)Qwen3.5 4B~2.5GBExperimental, WebGPU required

Best Practices

Generation Tips

  1. Stream for UX — Always use streamText() for user-facing apps
  2. Set max tokens — Prevent runaway generation
  3. Use system prompts — Guide model behavior consistently
  4. Handle errors — Wrap generation in try-catch
  5. Provide cancellation — Let users abort long generations

Next Steps

Showcase Apps

AppDescriptionLinks
LLM ChatStream text generation with multiple LLM backends and vision (image input)Demo · Source
PDF SearchGenerate answers from PDF context with streamingDemo · Source
LangChain RAGGenerate answers in a retrieval-augmented pipelineDemo · Source
Data ExtractorExtract structured data with generateObject()Demo · Source
Research AgentMulti-step reasoning with tool-augmented generationDemo · Source

On this page