← Back to Blog

From OpenAI to LocalMode: A Complete Migration Checklist

A ten-phase planning document for teams migrating from cloud AI APIs to local browser inference. Covers auditing current usage, mapping models, benchmarking quality, importing vectors from Pinecone and ChromaDB, handling dimension changes, calibrating thresholds, setting up fallbacks, and rolling out gradually.

LocalMode·

Moving from OpenAI (or any cloud AI provider) to local browser inference is not a flip-the-switch operation. It is a systematic process that touches your embedding pipeline, your vector database, your similarity thresholds, and your monitoring stack. Done well, the migration eliminates per-request costs, removes network latency, and guarantees that user data never leaves the device. Done carelessly, it produces silent quality regressions that are hard to debug after the fact.

This checklist is a planning document for engineering teams that have already decided to migrate. It is organized into ten phases, each with concrete steps, code snippets against the real LocalMode API, and decision points. Work through them in order.


Phase 1 - Audit Your Current AI Usage

Before writing any migration code, build a complete inventory of every cloud AI call your application makes.

What to catalog:

DimensionExample
API endpointPOST /v1/embeddings, POST /v1/chat/completions
Model IDtext-embedding-3-small, gpt-4o-mini
Call volume2.4M embedding calls/month, 180K chat completions/month
Average latency120ms embeddings, 1.8s chat completions
Monthly costEmbedding: ~$48/mo, Chat: ~$270/mo
Output dimensions1536 (embeddings), N/A (chat)
Where vectors are storedPinecone, ChromaDB, self-hosted Postgres

Cost baseline template:

ServiceModelMonthly CallsToken VolumeUnit CostMonthly Cost
Embeddingstext-embedding-3-small2,400,000480M tokens$0.02/1M tokens$9.60
Embeddingstext-embedding-3-large600,000300M tokens$0.065/1M tokens$19.50
Chatgpt-4o-mini180,00090M in + 45M out$0.15/$0.60 per 1M$40.50
Chatgpt-4o12,00024M in + 12M out$2.50/$10.00 per 1M$180.00
Total$249.60

Fill in your own numbers. This becomes your before measurement for the cost comparison at the end of migration.


Phase 2 - Map Cloud APIs to Local Equivalents

Every cloud API endpoint has a local counterpart. The model will be smaller and the output dimensions will differ, but the function signature in LocalMode matches the same pattern.

Cloud APICloud ModelLocalMode FunctionLocal Model (Transformers)Dimensions
Embeddingstext-embedding-3-small (1536d)embed()Xenova/bge-small-en-v1.5384
Embeddingstext-embedding-3-large (3072d)embed()Xenova/bge-base-en-v1.5768
Chat completionsgpt-4o-minigenerateText() / streamText()WebLLM: Llama-3.2-3B-Instruct-q4f16_1-MLCN/A
Chat completionsgpt-4ogenerateText() / streamText()Wllama: Qwen3-4B-q4_0.ggufN/A
Classificationcustom fine-tuneclassify()Xenova/distilbert-base-uncased-finetuned-sst-2-englishN/A
Summarizationgpt-4o-minisummarize()Xenova/distilbart-cnn-6-6N/A
Translationgpt-4o-minitranslate()Xenova/nllb-200-distilled-600MN/A
Speech-to-textwhisper-1transcribe()onnx-community/moonshine-base-ONNXN/A

Dimension mismatch is the biggest migration risk

OpenAI's text-embedding-3-small produces 1536-dimensional vectors. LocalMode's recommended bge-small-en-v1.5 produces 384-dimensional vectors. Existing vectors stored in Pinecone or ChromaDB cannot be mixed with new local vectors. Phase 5 addresses this directly.


Phase 3 - Benchmark Quality Before Migrating

Never migrate blind. Use evaluateModel() to run your local candidate against a labeled test set and compare scores to your cloud baseline.

import { evaluateModel, accuracy, f1Score } from '@localmode/core';
import { classify } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.classifier(
  'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
);

const result = await evaluateModel({
  dataset: {
    inputs: ['great product', 'terrible service', 'okay experience', /* ... */],
    expected: ['positive', 'negative', 'neutral', /* ... */],
  },
  predict: async (text, signal) => {
    const { label } = await classify({ model, text, abortSignal: signal });
    return label;
  },
  metric: f1Score,
  onProgress: (completed, total) => {
    console.log(`Evaluated ${completed}/${total}`);
  },
});

console.log(`Local F1: ${result.score.toFixed(3)}`);
console.log(`Duration: ${result.durationMs}ms over ${result.datasetSize} samples`);

Decision gate: If result.score is within 5% of your cloud baseline on the same test set, proceed. If the gap is larger, try a bigger local model or revisit your model mapping in Phase 2. Available metrics include accuracy, precision, recall, f1Score (with macro/micro/weighted averaging), bleuScore, rougeScore, mrr, and ndcg.


Phase 4 - Import Existing Vectors

If you have vectors stored in Pinecone, ChromaDB, or flat files, use importFrom() to bring them into a local VectorDB. The function auto-detects format, validates dimensions, and optionally re-embeds text-only records.

import { importFrom, createVectorDB } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const db = await createVectorDB({ name: 'migrated-docs', dimensions: 384 });
const model = transformers.embedding('Xenova/bge-small-en-v1.5');

// pineconeExport is a JSON string exported from Pinecone
const stats = await importFrom({
  db,
  content: pineconeExportJson,
  format: 'pinecone',           // also: 'chroma', 'csv', 'jsonl'
  model,                        // re-embeds text-only records
  batchSize: 100,
  onProgress: (p) => {
    console.log(`${p.phase}: ${p.completed}/${p.total}`);
  },
  abortSignal: controller.signal,
});

console.log(`Imported: ${stats.imported}`);
console.log(`Skipped: ${stats.skipped}`);
console.log(`Re-embedded: ${stats.reEmbedded}`);
console.log(`Format: ${stats.format}, Dimensions: ${stats.dimensions}`);

Supported formats: 'pinecone' (JSON export), 'chroma' (JSON collection export), 'csv' (RFC 4180 with vector column), and 'jsonl' (one record per line). If you omit the format option, importFrom() auto-detects it from the content structure.

Preview before importing

Use parseExternalFormat(content) to inspect record counts, vector dimensions, and format detection before committing to a full import. This is a synchronous, side-effect-free call that lets you validate the data shape.


Phase 5 - Handle the Dimension Mismatch

This is the phase most teams underestimate. OpenAI text-embedding-3-small outputs 1536 dimensions. bge-small-en-v1.5 outputs 384 dimensions. You cannot simply insert old vectors alongside new ones.

You have two options:

Option A: Re-embed everything (recommended). Pass the model parameter to importFrom(). Records that have source text in their metadata will be re-embedded with the local model. Records that only have vectors (no text) will be skipped. This is the cleanest approach because every vector in the target database is produced by the same model.

Option B: Maintain two databases. Keep the old Pinecone-dimension database for legacy queries and build a new 384-dimension database for all new content. Route queries based on content age. This is faster to deploy but creates permanent maintenance burden.

For Option A, the re-embedding happens in batches during the 'embedding' phase of importFrom(). The onProgress callback reports which phase is active and how many records are complete. For large collections (100K+ records), consider running the import in the background with AbortSignal support so users can cancel if needed.


Phase 6 - Calibrate Similarity Thresholds

Different embedding models produce different similarity score distributions. A cosine similarity threshold of 0.85 that worked well with OpenAI embeddings will almost certainly be wrong for bge-small-en-v1.5. Use calibrateThreshold() to compute the right value empirically from your own corpus.

import { calibrateThreshold } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.embedding('Xenova/bge-small-en-v1.5');

// Use a representative sample of your actual content
const corpus = [
  'How to reset my password',
  'Password recovery steps',
  'Shipping policy for international orders',
  'Return and refund process',
  'Account billing FAQ',
  // ... at least 50-100 representative texts
];

const { threshold, distribution } = await calibrateThreshold({
  model,
  corpus,
  percentile: 90,          // 90th percentile = balanced threshold
  distanceFunction: 'cosine',
  maxSamples: 200,
});

console.log(`Calibrated threshold: ${threshold.toFixed(4)}`);
console.log(`Distribution - mean: ${distribution.mean.toFixed(4)}, ` +
            `stdDev: ${distribution.stdDev.toFixed(4)}, ` +
            `min: ${distribution.min.toFixed(4)}, max: ${distribution.max.toFixed(4)}`);

// Use the calibrated threshold for search
const results = await db.search(queryVector, {
  k: 10,
  threshold,
});

How to choose the percentile: 80 is permissive (more results, more noise), 90 is balanced, 95 is strict (fewer results, higher precision). Start with 90 and adjust based on user feedback after rollout.


Phase 7 - Set Up Provider Fallbacks

Not every user's device can run local inference. A 2019 Chromebook with 4 GB of RAM will struggle with a 3B-parameter LLM. Use detectCapabilities() and checkModelSupport() to detect device limits and set up graceful degradation.

import {
  detectCapabilities,
  recommendModels,
  checkModelSupport,
} from '@localmode/core';

const caps = await detectCapabilities();

// Check if the target LLM can run
const support = await checkModelSupport({
  modelId: 'Llama-3.2-3B-Instruct-q4f16_1-MLC',
  estimatedMemory: 2_000_000_000,
  estimatedStorage: 1_800_000_000,
});

if (!support.supported) {
  console.warn(support.reason);
  // support.fallbackModels provides smaller alternatives
  for (const fallback of support.fallbackModels ?? []) {
    console.log(`Try: ${fallback.modelId} - ${fallback.reason}`);
  }
}

// Or let the recommendation engine pick the best model for this device
const recs = recommendModels(caps, {
  task: 'generation',
  maxSizeMB: 1000,
  limit: 3,
});

for (const rec of recs) {
  console.log(`${rec.entry.name} - score: ${rec.score}`);
  console.log(`  ${rec.reasons.join(', ')}`);
}

For embedding models specifically, the fallback chain is simpler because even low-end devices can run bge-small-en-v1.5 (33 MB quantized). The risk is primarily with LLMs and speech-to-text models.


Phase 8 - Detect Device Capabilities and Optimize Batch Sizes

Use computeOptimalBatchSize() to adapt to each user's hardware. This is especially important for the initial re-embedding pass (Phase 5) and ongoing ingestion pipelines.

import { computeOptimalBatchSize, streamEmbedMany } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.embedding('Xenova/bge-small-en-v1.5');

const { batchSize, reasoning, deviceProfile } = computeOptimalBatchSize({
  taskType: 'embedding',
  modelDimensions: 384,
});

console.log(`Batch size: ${batchSize}`);
console.log(`Device: ${deviceProfile.cores} cores, ${deviceProfile.memoryGB}GB RAM`);
console.log(reasoning);

// Use the computed batch size for streaming embeddings
for await (const { embedding, index } of streamEmbedMany({
  model,
  values: allTexts,
  batchSize,
  onBatch: ({ index, count, total }) => {
    console.log(`Progress: ${index + count}/${total}`);
  },
})) {
  await db.add({ id: `doc-${index}`, vector: embedding });
}

The formula scales linearly with CPU cores and available RAM relative to a reference device (4 cores, 8 GB). A device with 8 cores and 16 GB RAM gets 4x the base batch size. WebGPU presence adds a 1.5x multiplier. The result is always clamped between task-specific bounds (embedding: 4-256, ingestion: 8-512).


Phase 9 - Set Up Ongoing Monitoring

After migration, you need to detect if embedding quality drifts over time -- especially if you later switch to a different local model.

Drift detection with checkModelCompatibility():

import { checkModelCompatibility, reindexCollection } from '@localmode/core';

const result = await checkModelCompatibility(db, currentModel);

if (result.status === 'incompatible') {
  console.warn(
    `Model changed: ${result.storedModel?.modelId} -> ${result.currentModel.modelId}`
  );
  console.log(`${result.documentCount} documents need re-embedding`);

  // Automatically re-embed all documents
  const reindexResult = await reindexCollection(db, currentModel, {
    batchSize: 32,
    onProgress: ({ completed, total, skipped, phase }) => {
      console.log(`${phase}: ${completed}/${total} (${skipped} skipped)`);
    },
  });

  console.log(`Reindexed: ${reindexResult.reindexed}, ` +
              `Skipped: ${reindexResult.skipped}, ` +
              `Duration: ${reindexResult.durationMs}ms`);
}

checkModelCompatibility() compares the stored model fingerprint (model ID, provider, dimensions) against the current model. It returns one of three statuses: 'compatible' (same model), 'incompatible' (different model, same dimensions), or 'dimension-mismatch' (different dimensions, requires full re-embed). reindexCollection() is resumable -- if interrupted, it picks up from the last completed batch on the next run.

Periodic quality checks: Schedule evaluateModel() runs against a fixed golden test set on a weekly or monthly cadence. Track the score over time. A sudden drop indicates that a model update or data distribution shift needs attention.


Phase 10 - Gradual Rollout Strategy

Do not migrate 100% of traffic on day one. Use a phased rollout to catch issues early.

WeekLocal TrafficCloud TrafficGate Criteria
15% (internal/dogfood)95%No crashes, embeddings produce results
220%80%Search relevance within 5% of cloud baseline
350%50%P95 latency acceptable, no user complaints
480%20%Cost savings confirmed, edge cases handled
5100%0% (keep API key for emergency)Full migration, monitoring active

At each gate:

  1. Run evaluateModel() against your golden test set and compare to the cloud baseline score
  2. Check error rates in your application logs
  3. Monitor device capability distribution -- what percentage of users hit fallbacks?
  4. Compare search result click-through rates between local and cloud cohorts

Keep your cloud API key active for at least 90 days after reaching 100% local. If a critical edge case surfaces, you can route specific requests back to the cloud while you address it.


The Migration Payoff

Once complete, every AI operation runs at zero marginal cost. No per-token fees, no rate limits, no API key rotation, no vendor lock-in. The initial engineering investment pays for itself within weeks for any application processing more than a few hundred thousand AI calls per month.

The real win, though, is architectural: your application works offline, responds in milliseconds instead of hundreds of milliseconds, and your users' data never touches a third-party server. That is not an optimization -- it is a fundamentally different privacy posture.


Methodology

This checklist is based on the actual LocalMode API surface as implemented in @localmode/core (v1.x). All function signatures, option types, and return types referenced in this post match the shipped code. Pricing figures for OpenAI APIs are sourced from the OpenAI Pricing page as of March 2026. Embedding dimensions for text-embedding-3-small (1536d) and text-embedding-3-large (3072d) are documented in the OpenAI Embeddings guide. GPT-4o pricing ($2.50/$10.00 per 1M tokens) and GPT-4o-mini pricing ($0.15/$0.60 per 1M tokens) are from the OpenAI API pricing page. Device capability detection, model recommendation, and adaptive batching are covered in the LocalMode Core Capabilities documentation. The import/export system supporting Pinecone, ChromaDB, CSV, and JSONL formats is documented in the Import/Export guide. Evaluation metrics and the evaluateModel() orchestrator are documented in the Evaluation guide. Embedding drift detection and threshold calibration are documented in the Embedding Drift and Threshold Calibration guides.


Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.