← Back to Use Cases

SaaS "Local Mode" Toggle

Add a "Process Locally" toggle to your SaaS - give privacy-conscious users control over where AI runs.

SaaS "Local Mode" Toggle

Add a "Process Locally" toggle to your SaaS - give privacy-conscious users control over where AI runs.

Category: Architecture Pattern

The Problem

SaaS applications with AI features face a tension: some users want cloud-powered AI for maximum quality, while others (enterprises, regulated industries, privacy advocates) want guarantees that their data stays on-device. Building separate implementations for each is expensive.

This is a common challenge for teams building modern applications. Traditional approaches either compromise on privacy (by sending data to cloud APIs), require complex server infrastructure (adding cost and maintenance burden), or sacrifice functionality (by avoiding AI entirely). LocalMode provides a fourth option: run the AI locally in the browser.

The Solution

The "Local Mode Toggle" pattern lets users choose where AI processing happens. Both modes use the same application UI - only the inference backend changes. When toggled to "local," all AI functions route through LocalMode (browser inference). When toggled to "cloud," they route through your existing cloud API. LocalMode's function-first API makes this trivial: embed(), streamText(), classify() all accept a model parameter that can be either local or cloud.

Why Local-First?

Building this feature with on-device inference provides three structural advantages over cloud-based alternatives:

  1. Zero marginal cost - After the initial model download, every inference operation is free. No per-token fees, no monthly API bills, no surprise invoices. This matters especially for features used frequently or by many users.
  2. Architectural privacy - User data never leaves the device. This is not a policy promise ("we won't look at your data") but an architectural guarantee: the data physically cannot reach any server because the processing happens in the browser tab.
  3. Offline capability - Once models are cached in IndexedDB, the entire feature works without internet. This is critical for field deployments, mobile apps with spotty connectivity, and enterprise environments with restricted networks.

Technology Stack

PackagePurpose
@localmode/coreUnified API for both local and cloud models
@localmode/transformersLocal inference when "Process Locally" is enabled
@localmode/ai-sdkVercel AI SDK compatibility for cloud/local parity

Install the required packages:

npm install @localmode/core @localmode/transformers @localmode/ai-sdk

Implementation

// The toggle pattern: same API, different provider
import { embed } from '@localmode/core';
import { transformers } from '@localmode/transformers';

function getModel(isLocal: boolean) {
  if (isLocal) {
    return transformers.embedding('Xenova/bge-small-en-v1.5');
  }
  return cloudProvider.embedding('text-embedding-3-small');
}

// Application code doesn't change
const model = getModel(userPreferences.processLocally);
const { embedding } = await embed({ model, value: userQuery });

How This Works

The code above demonstrates the complete pipeline. Let us walk through the key decisions:

  • Model selection - The models referenced in this example are chosen for their balance of size, speed, and quality for this specific use case. Smaller models load faster and use less memory; larger models produce better results. Start with the recommended models and upgrade only if quality is insufficient for your users.
  • Browser APIs - LocalMode uses IndexedDB for persistent storage (vectors, model cache), Web Workers for background processing (keeping the UI responsive during inference), and the Web Crypto API for optional encryption.
  • Error handling - All LocalMode functions throw typed errors (ModelLoadError, StorageError, ValidationError) with actionable hints. Wrap calls in try/catch and use the error's hint property to display user-friendly messages.
  • Cancellation - Pass an AbortSignal to any long-running operation. This lets users cancel searches, embeddings, or generation without waiting for completion.

Production Considerations

When deploying this solution to production, consider these factors:

Model preloading: Download models during user onboarding or application setup, not on first use. Use preloadModel() with an onProgress callback to show download progress. This avoids the poor experience of a loading spinner on the first AI interaction.

Storage management: IndexedDB has browser-specific quotas (Chrome allows up to 60% of total disk space per origin, more restrictive on iOS Safari). Use getStorageQuota() to check available space and navigator.storage.persist() to request persistent storage that survives browser storage pressure.

Device adaptation: Not all users have the same hardware. Use detectCapabilities() and recommendModels() to select models appropriate for each user's device - call recommendModels(caps, { task }) with the detected capabilities. A desktop with a discrete GPU can handle 3GB models; a mobile phone with 3GB RAM should use models under 300MB.

Error boundaries: Wrap AI-powered components in error boundaries. If model loading fails (network error, storage quota exceeded, incompatible browser), fall back gracefully - show the non-AI version of the feature rather than crashing the page.

Frequently Asked Questions

How do I market this feature?

Position it as a competitive advantage: "Your data, your device." Enterprise buyers actively look for this capability. Some SaaS companies charge a premium for the "local processing" tier, while others use it as a differentiator in their standard plan.

Do users see quality differences?

For embeddings and classification, quality is nearly indistinguishable. For LLM generation, local models (1-4B parameters) are noticeably less capable than GPT-4o on complex tasks but adequate for most chat and summarization use cases. Be transparent about the trade-off.

Methodology

Code examples were verified against the LocalMode monorepo source (packages/core/src/, packages/transformers/src/). Every function (embed, streamText, classify, getStorageQuota, detectCapabilities, recommendModels, preloadModel) and error class (ModelLoadError, StorageError, ValidationError) was confirmed as a real export. The IndexedDB storage quota figure was verified against the Chrome-team article on web.dev. No fabricated cost-savings percentages appear in this post - cost claims are limited to architectural facts (zero marginal per-inference cost after model download).

Sources