How does LocalMode's API compare to the OpenAI SDK?

Both follow the same pattern: construct an options object, await the result, destructure fields. LocalMode's embed() maps to openai.embeddings.create(), streamText() maps to chat.completions.create() with streaming, and generateObject() maps to structured output. The mental model is remarkably similar.

What are the main benefits of migrating from OpenAI to LocalMode?

Three primary benefits: cost drops to $0 after initial model download (vs. per-token pricing), data privacy is guaranteed since inference runs entirely in the browser, and latency improves to 5-50ms for embeddings with no network round-trips or cold starts.

What are the quality tradeoffs when moving from OpenAI to local models?

Local models deliver 85-99% of cloud quality depending on the task. Embeddings match at 99%, classification at 93%+, and NER at 93-96%. The widest gap is in complex reasoning and long-context generation, where cloud models like GPT-5 remain significantly ahead.

Do I need to change my application architecture to use LocalMode?

The migration is mostly a provider swap. Replace openai.embeddings.create() with embed() from @localmode/core, and openai.chat.completions.create() with streamText(). No API keys are needed, and models auto-download and cache in the browser on first use.

From OpenAI SDK to LocalMode: A Migration Guide

If your application calls openai.chat.completions.create() or openai.embeddings.create(), you already know the pattern: construct an options object, await the result, destructure the fields you need. LocalMode follows the same shape. The functions are named differently, the models run in the browser instead of on a remote server, and your data never leaves the device -- but the mental model is remarkably similar.

This guide walks through every common OpenAI SDK operation, shows the LocalMode equivalent side-by-side, and covers the tradeoffs you should evaluate before migrating.

Why Migrate?

Three reasons keep coming up in conversations with teams that have moved:

Cost. OpenAI charges per token. At scale, embeddings alone can cost thousands per year. LocalMode is $0 after the initial model download.
Privacy. Regulated industries (healthcare, finance, legal) often cannot send user data to third-party APIs. Local inference means data never leaves the browser.
Latency. No network round-trip. Once the model is loaded, inference is instant -- typically 5-50ms for embeddings, with no cold starts.

The tradeoff is model size: cloud models like GPT-5 are far larger than anything a browser can run. For many tasks -- embeddings, classification, summarization, translation, small-model chat -- local models deliver 85-99% of cloud quality at zero marginal cost.

Installation

OpenAI requires one package and an API key. LocalMode requires a core package and at least one provider -- but no key.

# OpenAI
npm install openai
# Requires: OPENAI_API_KEY environment variable

# LocalMode
npm install @localmode/core @localmode/transformers  # embeddings, classification, vision, audio
npm install @localmode/webllm                         # LLM chat (optional)
# Requires: nothing

Operation 1: Embeddings

Embeddings are the most common migration starting point. The API shape is nearly identical.

OpenAI

import OpenAI from 'openai';

const openai = new OpenAI(); // reads OPENAI_API_KEY from env

const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'How do neural networks work?',
});

const vector = response.data[0].embedding;  // number[] (1536 dims)
const tokens = response.usage.total_tokens; // token count

LocalMode

import { embed } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const { embedding, usage } = await embed({
  model: transformers.embedding('Xenova/bge-small-en-v1.5'),
  value: 'How do neural networks work?',
});

const vector = embedding;     // Float32Array (384 dims)
const tokens = usage.tokens;  // token count

What changed

Aspect	OpenAI	LocalMode
Function	`openai.embeddings.create()`	`embed()`
Model param	`model: 'text-embedding-3-small'`	`model: transformers.embedding('Xenova/bge-small-en-v1.5')`
Input param	`input: string`	`value: string`
Result vector	`response.data[0].embedding` (number[])	`embedding` (Float32Array)
Dimensions	1536	384
Auth	API key required	None
Network	HTTPS request to OpenAI servers	Local inference, no network

Quality comparison

Metric	text-embedding-3-small	bge-small-en-v1.5	Ratio
MTEB Average	62.3	62.2	99.8%
Dimensions	1536	384	--
Model size	Cloud-hosted	33 MB (browser)	--
Cost per 1M tokens	$0.02	$0	--

At 62.2 vs 62.3 on the MTEB benchmark, bge-small-en-v1.5 matches OpenAI's small embedding model almost exactly -- with 4x fewer dimensions (meaning 4x less storage) and zero ongoing cost.

Operation 2: Chat Completion

OpenAI

const response = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain quantum computing in simple terms' },
  ],
  max_tokens: 200,
  temperature: 0.7,
});

const text = response.choices[0].message.content;
const usage = response.usage; // { prompt_tokens, completion_tokens, total_tokens }

LocalMode

import { generateText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const { text, usage } = await generateText({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  systemPrompt: 'You are a helpful assistant.',
  prompt: 'Explain quantum computing in simple terms',
  maxTokens: 200,
  temperature: 0.7,
});

// text: string (the generated response)
// usage: { inputTokens, outputTokens, totalTokens, durationMs }

What changed

Aspect	OpenAI	LocalMode
Function	`openai.chat.completions.create()`	`generateText()`
System prompt	Inside `messages` array	Dedicated `systemPrompt` param
Result text	`response.choices[0].message.content`	`text` (top-level)
Token counts	`prompt_tokens` / `completion_tokens`	`inputTokens` / `outputTokens`
Timing	Not included	`usage.durationMs` built-in
Cost per 1M tokens (input/output)	$0.15 / $0.60	$0 / $0

Model size tradeoff

GPT-4o-mini is estimated at hundreds of billions of parameters running on dedicated GPU clusters. Llama-3.2-1B runs at 712 MB in the browser via WebGPU. For simple Q&A, summarization, and extraction tasks, the local model handles most requests well. For complex reasoning or multi-step logic, cloud models still have a significant edge.

Operation 3: Streaming

Streaming is where the API divergence is smallest. Both return an async iterable you consume with for await.

OpenAI

const stream = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: 'Write a short story about a robot' }],
  max_tokens: 500,
  stream: true,
});

for await (const chunk of stream) {
  const text = chunk.choices[0]?.delta?.content ?? '';
  process.stdout.write(text);
}

LocalMode

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const result = await streamText({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  prompt: 'Write a short story about a robot',
  maxTokens: 500,
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

// Bonus: await the full text and usage after the stream completes
const fullText = await result.text;
const usage = await result.usage;

What stayed the same

Both APIs use for await over an async iterable. Both yield incremental text deltas. The overall pattern -- start a stream, consume chunks, render in real time -- is identical.

What changed

OpenAI puts the text delta at chunk.choices[0].delta.content. LocalMode puts it at chunk.text. LocalMode also provides result.text and result.usage as promises that resolve when the stream finishes -- a convenience OpenAI does not offer natively.

Operation 4: Structured Output (JSON Mode)

OpenAI offers response_format: { type: 'json_object' } and the newer Structured Outputs with response_format: { type: 'json_schema', json_schema: ... }. LocalMode uses generateObject() with a Zod schema.

OpenAI

const response = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [
    { role: 'system', content: 'Extract contact info. Respond in JSON with name and email fields.' },
    { role: 'user', content: 'Hi, I\'m Sarah at sarah@acme.co' },
  ],
  response_format: { type: 'json_object' },
});

const data = JSON.parse(response.choices[0].message.content);
// data: { name: "Sarah", email: "sarah@acme.co" } - hopefully

LocalMode

import { generateObject, jsonSchema } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { z } from 'zod';

const { object } = await generateObject({
  model: webllm.languageModel('Qwen3-1.7B-q4f16_1-MLC'),
  schema: jsonSchema(z.object({
    name: z.string(),
    email: z.string().email(),
  })),
  prompt: 'Extract contact info: "Hi, I\'m Sarah at sarah@acme.co"',
});

// object: { name: "Sarah", email: "sarah@acme.co" } - validated by Zod

What changed

Aspect	OpenAI	LocalMode
Schema definition	Prompt engineering or `json_schema` param	Zod schema via `jsonSchema()`
Validation	Manual `JSON.parse()` + your own checks	Automatic Zod validation with retry
Retry on failure	Manual	Built-in (up to `maxRetries` attempts with self-correction)
Type safety	None (returns `string`)	Full TypeScript inference from Zod schema

LocalMode's generateObject() is arguably stricter: it parses the model's output, validates against your Zod schema, and retries with error feedback if validation fails. OpenAI's JSON mode guarantees valid JSON but not schema conformance.

Operation 5: Batch Embeddings

Embedding many documents at once is critical for RAG ingestion. Both APIs support batching, but the mechanics differ.

OpenAI

const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: [
    'Machine learning is a subset of AI.',
    'Neural networks are inspired by biology.',
    'Deep learning uses multiple layers.',
  ],
});

const vectors = response.data.map((d) => d.embedding);
// vectors: number[][] (3 x 1536)

LocalMode

import { embedMany } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const { embeddings, usage } = await embedMany({
  model: transformers.embedding('Xenova/bge-small-en-v1.5'),
  values: [
    'Machine learning is a subset of AI.',
    'Neural networks are inspired by biology.',
    'Deep learning uses multiple layers.',
  ],
});

// embeddings: Float32Array[] (3 x 384)

For large-scale ingestion with progress tracking, LocalMode also offers streamEmbedMany():

import { streamEmbedMany } from '@localmode/core';

for await (const { embedding, index } of streamEmbedMany({
  model: transformers.embedding('Xenova/bge-small-en-v1.5'),
  values: thousandsOfDocuments,
  batchSize: 32,
  onBatch: ({ index, count, total }) => {
    console.log(`Progress: ${index + count}/${total}`);
  },
})) {
  await db.add({ id: `doc-${index}`, vector: embedding });
}

OpenAI has no equivalent streaming embeddings API. For large batches, you must paginate manually or use the Batch API (which has a 24-hour completion window).

The Minimum-Change Migration: AI SDK Adapter

If you are already using the Vercel AI SDK with @ai-sdk/openai, LocalMode offers a drop-in adapter that lets you swap providers by changing one import. Your generateText(), streamText(), and embed() calls stay the same.

Before (OpenAI via AI SDK)

import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';

const { text } = await generateText({
  model: openai('gpt-4o-mini'),
  prompt: 'Explain quantum computing',
});

After (LocalMode via AI SDK)

import { generateText } from 'ai';
import { createLocalMode } from '@localmode/ai-sdk';
import { webllm } from '@localmode/webllm';

const localmode = createLocalMode({
  models: {
    'llama': webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  },
});

const { text } = await generateText({
  model: localmode.languageModel('llama'),
  prompt: 'Explain quantum computing',
});

The generateText call is identical. Only the model reference changed. This makes it straightforward to run local models in development or privacy-sensitive contexts and cloud models in production -- or vice versa.

What Stays the Same

If you are comfortable with the OpenAI SDK, most of your intuition carries over:

Options-object pattern. Both APIs use a single options object: { model, prompt, maxTokens, temperature }.
Async/await. Every operation returns a promise. Streaming returns an async iterable.
Structured results. Both return objects with text/embedding data plus usage metadata.
Cancellation. OpenAI supports signal via the second argument. LocalMode supports abortSignal in the options object.
Retry logic. Both handle transient failures with retries (OpenAI in the SDK client, LocalMode via maxRetries).

What Changes

Concern	OpenAI SDK	LocalMode
Authentication	API key in env or constructor	None required
Network dependency	Every call hits OpenAI servers	No network after model download
First-load cost	None (cloud-hosted)	Model download (33 MB for embeddings, 712 MB+ for LLMs)
Model range	GPT-5, o1, DALL-E, Whisper, etc.	60+ browser models across 18 categories
Max model quality	State-of-the-art (GPT-5, o1)	Good for tasks, limited for complex reasoning
Pricing model	Pay per token	Free after download
Data residency	OpenAI servers (US)	User's browser (device-local)
Offline support	No	Yes, after initial model cache
Vector format	`number[]`	`Float32Array` (more memory-efficient)

Addressing Common Concerns

"Are local models good enough?"

For embeddings, yes -- bge-small-en-v1.5 scores 99.8% of OpenAI's text-embedding-3-small on MTEB. For classification, NER, summarization, and translation, local models deliver 85-98% of cloud quality. For LLM chat, it depends on the task: simple Q&A and extraction work well; multi-step reasoning still favors larger cloud models. See our full benchmark post for detailed comparisons across 18 model categories.

"What about the first-load download?"

Embedding models are 33 MB. The smallest LLMs start at 78 MB (SmolLM2-135M). After the first download, models are cached in IndexedDB and load instantly on subsequent visits. Use preloadModel() to download during onboarding or show a progress indicator.

"Which browsers are supported?"

Embedding, classification, and most non-LLM models work in all modern browsers via WebAssembly (Chrome 80+, Firefox 75+, Safari 14+, Edge 80+). LLM inference via WebLLM requires WebGPU (Chrome 113+, Edge 113+, Safari 26+). For broader LLM compatibility, @localmode/wllama runs GGUF models via llama.cpp compiled to WASM -- no WebGPU needed.

"Can I use both local and cloud models?"

Absolutely. Many teams use local models for privacy-sensitive operations (PII processing, on-device search) and cloud models for tasks that need maximum quality. The AI SDK adapter makes switching between providers a one-line change.

Migration Checklist

Use this checklist to plan your migration:

Cost Projection

For a concrete example, consider an application that processes 10 million embedding tokens and 2 million LLM tokens per month:

	OpenAI (monthly)	LocalMode (monthly)
Embeddings (10M tokens)	$0.20	$0
LLM input (1M tokens)	$0.15	$0
LLM output (1M tokens)	$0.60	$0
Total	$0.95	$0

At small scale, the savings are modest. At 1 billion embedding tokens per month (common for search products), the gap widens to $200/month for embeddings alone -- with zero privacy risk and zero vendor dependency.

Methodology

This guide references the following sources for API signatures, pricing, and benchmarks:

OpenAI API Reference -- official SDK documentation for chat.completions.create() and embeddings.create()
OpenAI API Pricing -- token pricing for gpt-4o-mini ($0.15/$0.60 per 1M tokens) and text-embedding-3-small ($0.02 per 1M tokens)
OpenAI Node.js SDK -- official JavaScript/TypeScript library
MTEB Leaderboard -- Massive Text Embedding Benchmark for quality comparisons
BAAI/bge-small-en-v1.5 -- model card with benchmark scores
Vercel AI SDK Documentation -- generateText(), streamText(), and embed() reference
LocalMode API signatures verified against packages/core/src/ source code (commit 66a6bf4)

Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.

Frequently Asked Questions