Using LocalMode With the Vercel AI SDK: generateText() and streamText() With Zero Cloud Calls
Drop @localmode/ai-sdk into any Vercel AI SDK project and run generateText(), streamText(), and embed() entirely in the browser. Same API, same patterns, zero network requests. This guide shows you how to swap one line and go fully local.
If you have built anything with the Vercel AI SDK, you already know the pattern: import generateText or streamText from ai, pass a model, get results. It is the same whether you use OpenAI, Anthropic, Google, or any other provider.
What if you could keep that exact pattern -- the same imports, the same function signatures, the same result shapes -- but run everything on-device? No API keys. No per-token billing. No data leaving the browser.
That is what @localmode/ai-sdk does. It bridges LocalMode's in-browser models to the AI SDK's LanguageModelV3 and EmbeddingModelV3 interfaces. You change one line, and your generateText() call goes from a network round-trip to a local WebGPU inference.
What Is @localmode/ai-sdk?
@localmode/ai-sdk is a thin adapter package. It does not contain any models itself. Instead, it wraps LocalMode model instances -- from @localmode/webllm, @localmode/transformers, @localmode/wllama, or @localmode/chrome-ai -- as AI SDK-compatible LanguageModelV3 and EmbeddingModelV3 objects.
The architecture is simple:
+------------------------------------------+
| AI SDK (generateText, streamText, embed) |
+--------------------+---------------------+
|
@localmode/ai-sdk
(adapter / bridge layer)
|
+--------------------+---------------------+
| LocalMode models (webllm, transformers) |
| Running entirely in the browser |
+------------------------------------------+The adapter handles the translation between LocalMode's types and the AI SDK's types: converting Float32Array embeddings to number[] arrays, mapping finish reasons, translating prompt formats, and wiring up streaming via ReadableStream. Your application code sees a standard AI SDK provider.
Installation
You need three things: the AI SDK itself, the adapter, and at least one LocalMode provider.
# The adapter + AI SDK
pnpm install @localmode/ai-sdk @localmode/core ai @ai-sdk/provider @ai-sdk/provider-utils
# Pick your model providers
pnpm install @localmode/webllm # LLMs via WebGPU (Llama, Qwen, Phi, Gemma)
pnpm install @localmode/transformers # Embeddings, classification, and moreThe peer dependencies are ai (>=6.0.0), @ai-sdk/provider (>=1.0.0), @ai-sdk/provider-utils (>=3.0.0), and @localmode/core (>=1.0.0). If you are already on a recent AI SDK version, you likely have the first three installed.
Creating the Provider
The entry point is createLocalMode(). You give it a map of friendly model IDs to pre-configured LocalMode model instances:
import { createLocalMode } from '@localmode/ai-sdk';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
const localmode = createLocalMode({
models: {
'llama': webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
'embedder': transformers.embedding('Xenova/bge-small-en-v1.5'),
},
});The returned localmode object implements the AI SDK ProviderV3 interface. It is callable as a function and also exposes named methods:
// All three return a LanguageModelV3:
localmode('llama');
localmode.languageModel('llama');
// Returns an EmbeddingModelV3:
localmode.embeddingModel('embedder');You can register as many models as you want. Mix providers freely -- a WebLLM language model alongside a Transformers.js embedding model alongside a wllama GGUF model. The adapter does not care; it checks the model interface at runtime and wraps accordingly.
Side-by-Side: OpenAI vs LocalMode
Here is the key insight. The application code is identical. Only the model line changes.
generateText()
import { generateText } from 'ai';
// --- Cloud: OpenAI (requires OPENAI_API_KEY, sends data to API) ---
import { openai } from '@ai-sdk/openai';
const { text } = await generateText({
model: openai('gpt-4o'),
prompt: 'Explain quantum computing in simple terms',
});
// --- Local: LocalMode (no API key, runs in the browser) ---
import { createLocalMode } from '@localmode/ai-sdk';
import { webllm } from '@localmode/webllm';
const localmode = createLocalMode({
models: { 'llama': webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC') },
});
const { text } = await generateText({
model: localmode.languageModel('llama'),
prompt: 'Explain quantum computing in simple terms',
});Same import. Same function. Same destructured { text } result. The difference is where the computation happens.
streamText()
import { streamText } from 'ai';
// --- Cloud ---
const result = streamText({
model: openai('gpt-4o'),
prompt: 'Write a short story about a robot learning to paint',
});
// --- Local ---
const result = streamText({
model: localmode.languageModel('llama'),
prompt: 'Write a short story about a robot learning to paint',
});
// Consuming the stream is identical in both cases
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}Under the hood, when the LocalMode model supports doStream() (which WebLLM, wllama, and Transformers.js language models all do), the adapter creates a ReadableStream of LanguageModelV3StreamPart chunks, emitting text-delta events as tokens arrive. If a model only supports non-streaming generation, the adapter falls back gracefully: it calls doGenerate() and emits the full text as a single chunk.
embed()
import { embed } from 'ai';
// --- Cloud ---
import { openai } from '@ai-sdk/openai';
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: 'What is the meaning of life?',
});
// --- Local ---
const { embedding } = await embed({
model: localmode.embeddingModel('embedder'),
value: 'What is the meaning of life?',
});LocalMode embedding models produce Float32Array vectors internally. The adapter converts them to number[] arrays automatically, which is what the AI SDK expects. Token usage is passed through as-is.
Swapping Providers With One Line
The real power of the AI SDK's provider abstraction shows up when you want to support both local and cloud models. You can make the swap conditional:
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { createLocalMode } from '@localmode/ai-sdk';
import { webllm } from '@localmode/webllm';
const localmode = createLocalMode({
models: {
'local-llm': webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
},
});
// Toggle with a single variable
const USE_LOCAL = true;
const model = USE_LOCAL
? localmode.languageModel('local-llm')
: openai('gpt-4o');
const { text } = await generateText({
model,
prompt: 'Summarize the key principles of privacy by design',
});This pattern is useful for progressive enhancement. Start with cloud models during development, then switch to local models for production deployments where privacy or cost matters. Or let users choose: offer a "Private Mode" toggle that swaps the provider at runtime.
Chat Conversations With Messages
Both generateText() and streamText() support the system and messages parameters. The adapter translates AI SDK's prompt format into LocalMode's ChatMessage format:
import { streamText } from 'ai';
const result = streamText({
model: localmode.languageModel('llama'),
system: 'You are a helpful coding assistant. Be concise.',
messages: [
{ role: 'user', content: 'What is a closure in JavaScript?' },
{ role: 'assistant', content: 'A closure is a function that...' },
{ role: 'user', content: 'Can you show me an example?' },
],
maxOutputTokens: 500,
temperature: 0.7,
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}The adapter extracts text parts from multimodal user messages, builds a simple prompt string from the last user message for LocalMode's doGenerate interface, and passes the full message history as ChatMessage[].
Configuration Options
All standard AI SDK call options are forwarded to the underlying LocalMode model:
| AI SDK Option | Maps To | Description |
|---|---|---|
maxOutputTokens | maxTokens | Maximum tokens to generate |
temperature | temperature | Sampling temperature (0-2) |
topP | topP | Nucleus sampling threshold |
stopSequences | stopSequences | Sequences that stop generation |
abortSignal | abortSignal | Cancellation support |
Cancellation works end-to-end. Pass an AbortSignal to generateText() or streamText(), and it propagates through the adapter to the underlying WebGPU or WASM inference.
What Is Not Supported (Yet)
Adapter limitations
These limitations apply to the adapter layer, not to LocalMode itself. You can always use LocalMode's native API (generateText, streamText, generateObject from @localmode/core) for full functionality.
- Tool calling -- Small local models have limited tool-calling ability. The adapter returns text-only content.
- Structured output / JSON mode -- AI SDK 6's
outputsetting for structured data is not wired through the adapter. UsegenerateObject()from@localmode/coredirectly for structured extraction. - Image generation -- LocalMode does not include a generative image model, so
ImageModelV3is not implemented. - WebGPU required for LLMs -- WebLLM requires WebGPU support (Chrome 113+, Edge 113+, Safari 18+). Embedding models via Transformers.js work on WASM and are broadly compatible.
When To Use the Adapter vs LocalMode's Native API
Use @localmode/ai-sdk when:
- You have an existing AI SDK codebase and want to go local with minimal changes
- You want the ability to swap between cloud and local providers seamlessly
- You are building a hybrid architecture where some requests go to the cloud and others stay local
- You want to use AI SDK's
useChat()React hook with local models
Use LocalMode's native API directly when:
- You need structured output via
generateObject()orstreamObject() - You want LocalMode-specific features like semantic caching, language model middleware, inference queues, or the agent framework
- You are building a new project and do not need cloud provider compatibility
- You need access to non-LLM capabilities (classification, translation, OCR, speech-to-text) that the AI SDK adapter does not cover
Both approaches can coexist in the same project. The adapter is just a bridge; the underlying models are the same.
A Complete Example: Local Chat With Streaming
Here is a minimal but complete example that creates a streaming chat interface using only local models:
import { streamText } from 'ai';
import { createLocalMode } from '@localmode/ai-sdk';
import { webllm } from '@localmode/webllm';
// 1. Set up the provider
const localmode = createLocalMode({
models: {
'chat': webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
},
});
// 2. Stream a response
const result = streamText({
model: localmode.languageModel('chat'),
system: 'You are a helpful assistant. Keep answers under 200 words.',
messages: [
{ role: 'user', content: 'What are three benefits of local AI inference?' },
],
maxOutputTokens: 300,
temperature: 0.7,
});
// 3. Print tokens as they arrive
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}
// 4. Get final usage stats
const usage = await result.usage;
console.log('\n\nTokens used:', usage);No API key configured. No environment variable set. No network request made. The model downloads once (cached in the browser), and every subsequent call runs entirely on the user's GPU.
Methodology
This post is based on the actual implementation of @localmode/ai-sdk (version 1.0.0), the Vercel AI SDK documentation, and the AI SDK provider specification. All code examples use real API signatures from both the AI SDK (ai package v6+) and LocalMode packages.
Sources:
- AI SDK generateText reference
- AI SDK streamText reference
- AI SDK embed reference
- AI SDK custom provider guide
- AI SDK provider management
- LocalMode AI SDK documentation
Try it yourself
Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.
Read the Getting Started guide to add local AI to your application in under 5 minutes.