What is a hybrid AI architecture combining local and cloud models?

A hybrid AI architecture routes roughly 95% of AI requests -- embeddings, classification, NER, reranking, and summarization -- to local browser models at zero cost, while reserving cloud APIs for the 5% of tasks that genuinely require frontier reasoning, such as complex multi-step logic or long-context generation.

How much faster is local AI inference compared to cloud APIs?

Local inference eliminates network round-trips entirely. Embeddings run in 8-30ms locally versus 250-2,000ms via cloud APIs (8-66x faster). Classification takes 10-50ms locally versus 300-1,500ms in the cloud (6-30x faster).

How much money can hybrid AI architecture save at scale?

For an application with 10,000 daily active users making 50 AI calls per day, a 95/5 hybrid approach reduces annual cloud API costs from $50K-$300K+ down to $2.5K-$15K, capturing 95% of savings while maintaining full quality coverage.

How does capability detection work in a hybrid AI system?

LocalMode's detectCapabilities() function checks hardware specs including WebGPU support, WASM SIMD, device memory, CPU cores, and available storage. The checkModelSupport() function then determines if a specific model can run on the device, suggesting fallback models if not.

The Hybrid AI Architecture: Local Models for 95% of Requests, Cloud for the Rest

Q: What percentage of AI requests can local browser models handle?

Analysis of 20+ showcase applications shows approximately 95% of AI calls fall into categories where local models deliver 90-99% of cloud quality. These include embeddings (99% of OpenAI), classification (93%+), NER (93-96%), and reranking (87-93%).

The most common objection to local AI is also the most reasonable one: cloud models are better.

That is true. GPT-5, Claude, and Gemini are better at complex reasoning, nuanced creative writing, and multi-step problem solving than any model that fits in a browser tab. No serious person disputes this.

But here is the question that matters for production architecture: what percentage of your AI requests actually require frontier-quality reasoning?

We analyzed the request patterns across the 20+ showcase applications at localmode.ai and found a consistent distribution. Roughly 95% of AI calls fall into categories where local models deliver 90-99% of cloud quality: embeddings, classification, NER, reranking, extractive QA, summarization, and translation. The remaining 5% - open-ended generation, complex multi-step reasoning, long-context synthesis - genuinely benefit from cloud APIs.

The pragmatic architecture is not local OR cloud. It is local for the 95%, cloud for the 5%. This post shows you exactly how to build it.

The 95/5 Split: What Goes Where

Not all AI tasks are created equal. Some are well-defined, bounded operations where a small purpose-built model matches or nearly matches a general-purpose LLM. Others require the kind of broad reasoning that only frontier models deliver today.

Reference benchmarks

Quality percentages in this table are from our benchmark analysis of 18 model categories against OpenAI, Google, Cohere, and AWS. All local models run entirely in the browser.

Task	Local Model	Quality vs Cloud	Cloud Fallback	When to Use Cloud
Embeddings	`bge-small-en-v1.5` (33MB)	99% of OpenAI	`text-embedding-3-small`	Never - local matches cloud
Sentiment / Classification	`distilbert-sst-2` (67MB)	93%+ of GPT-5	GPT-5	Multi-label with 50+ categories
Zero-shot classification	`mobilebert-mnli` (100MB)	92-95% of GPT-5	GPT-5	Ambiguous or overlapping labels
Named entity recognition	`bert-base-NER` (178MB)	93-96% of GPT-5	GPT-5	Rare entity types, nested entities
Extractive QA	`distilbert-squad` (67MB)	90-93% of GPT-5	GPT-5	Multi-hop reasoning over long docs
Reranking	`ms-marco-MiniLM-L6-v2` (23MB)	87-93% of Cohere	Cohere Rerank	Multilingual or domain-specific
Summarization	`distilbart-cnn-6-6` (305MB)	80-85% of GPT-5	GPT-5	Abstractive over 10K+ tokens
Translation	`opus-mt-*` (~100MB each)	85% of Google Translate	Google Translate	Rare language pairs, literary text
Speech-to-text	`moonshine-base` (63MB)	~80% of Whisper API	Whisper API	Heavy accents, noisy environments
LLM chat	Qwen3-4B (2.2GB)	84% MMLU-Redux	GPT-5	Complex reasoning, 272K context
LLM chat (thinking)	Qwen3.5-4B (~2.5GB)	~96% MMLU-Redux	GPT-5	Creative writing, nuanced tone

The first seven rows - embeddings through translation - represent the vast majority of AI calls in typical production applications. Search boxes embed queries. Moderation pipelines classify content. Support systems extract entities and answer questions. Recommendation engines rerank results. None of these require GPT-5.

Architecture Overview

A hybrid AI architecture uses capability detection at startup, routes the common path through local models, and escalates to cloud APIs only when the task exceeds what local inference can handle.

                            +------------------+
                            |   User Request   |
                            +--------+---------+
                                     |
                              +------v-------+
                              |  Task Router  |
                              +------+-------+
                                     |
                    +----------------+----------------+
                    |                                  |
          +---------v----------+            +----------v-----------+
          |   LOCAL PATH (95%) |            |   CLOUD PATH (5%)    |
          |                    |            |                      |
          |  Embeddings        |            |  Complex reasoning   |
          |  Classification    |            |  Long-context gen    |
          |  NER               |            |  Creative writing    |
          |  Reranking         |            |  Multi-hop QA        |
          |  Extractive QA     |            |  Rare languages      |
          |  Summarization     |            |                      |
          |  Translation       |            |                      |
          +---------+----------+            +----------+-----------+
                    |                                  |
          +---------v----------+            +----------v-----------+
          | @localmode/core    |            |  Cloud API (fetch)   |
          | @localmode/        |            |  OpenAI / Cohere /   |
          |   transformers     |            |  Google / Anthropic  |
          +---------+----------+            +----------+-----------+
                    |                                  |
                    +----------------+----------------+
                                     |
                              +------v-------+
                              |   Response   |
                              +--------------+

The key insight: the router is not a load balancer. It is a capability-aware decision function that checks what the device can handle and what the task requires.

Step 1: Detect Device Capabilities

Before routing any request, determine what the user's device can support. LocalMode provides detectCapabilities() for a full hardware and feature inventory, plus individual checks for specific features.

import {
  detectCapabilities,
  isWebGPUSupported,
  checkModelSupport,
} from '@localmode/core';

// Full capability report at app startup
const capabilities = await detectCapabilities();

console.log('WebGPU:', capabilities.features.webgpu);
console.log('WASM SIMD:', capabilities.features.simd);
console.log('Device memory:', capabilities.hardware.memory, 'GB');
console.log('CPU cores:', capabilities.hardware.cores);
console.log('Storage available:', capabilities.storage.availableBytes);

// Check if a specific model can run on this device
const modelCheck = await checkModelSupport({
  modelId: 'Qwen3-4B-Instruct',
  estimatedMemory: 2_200_000_000,  // 2.2GB
  estimatedStorage: 2_200_000_000,
  prefersWebGPU: true,
});

if (!modelCheck.supported) {
  console.log('Cannot run locally:', modelCheck.reason);
  console.log('Fallback models:', modelCheck.fallbackModels);
  // → Route to cloud API instead
}

detectCapabilities() returns browser info, device type, hardware specs (cores, memory, GPU), feature flags (WebGPU, WASM, SIMD, threads, IndexedDB), and storage quota - everything you need to make routing decisions.

Step 2: Route Tasks to Local Models

For the 95% path, use LocalMode's function-first API. Each function accepts a model interface - implementations come from provider packages like @localmode/transformers.

import { embed, classify, rerank, extractEntities } from '@localmode/core';
import { transformers } from '@localmode/transformers';

// Embeddings - 99% of OpenAI quality, 8-30ms latency, $0
const { embedding } = await embed({
  model: transformers.embedding('Xenova/bge-small-en-v1.5'),
  value: userQuery,
});

// Classification - 93%+ of GPT-5, 10-50ms, $0
const { label, score } = await classify({
  model: transformers.classifier(
    'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
  ),
  text: userMessage,
});

// NER - 93-96% of GPT-5, 30-100ms, $0
const { entities } = await extractEntities({
  model: transformers.ner('Xenova/bert-base-NER'),
  text: documentText,
});

// Reranking - 87-93% of Cohere, 20-80ms, $0
const { results } = await rerank({
  model: transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2'),
  query: searchQuery,
  documents: candidateResults,
});

Every function supports abortSignal for cancellation, maxRetries for resilience, and returns structured results with usage metadata. Latencies are after model warm-up - first inference includes a one-time model download (cached in IndexedDB for subsequent visits).

Step 3: Build the Routing Middleware

For text generation - the task most likely to need cloud escalation - use wrapLanguageModel() to create middleware that routes based on task complexity.

import {
  generateText,
  streamText,
  wrapLanguageModel,
  detectCapabilities,
} from '@localmode/core';
import { webllm } from '@localmode/webllm';
import type { LanguageModelMiddleware } from '@localmode/core';

// Create a routing middleware that checks task complexity
const hybridRoutingMiddleware: LanguageModelMiddleware = {
  wrapGenerate: async ({ doGenerate, prompt, model }) => {
    // Estimate if this task needs cloud escalation
    const needsCloud =
      prompt.length > 8000 ||              // Long context
      requiresMultiStepReasoning(prompt) || // Complex reasoning
      requiresCreativeWriting(prompt);      // Nuanced generation

    if (needsCloud) {
      // Escalate to cloud API
      const response = await fetch('/api/generate', {
        method: 'POST',
        body: JSON.stringify({ prompt }),
      });
      const data = await response.json();
      return {
        text: data.text,
        finishReason: 'stop' as const,
        usage: {
          inputTokens: data.usage.input_tokens,
          outputTokens: data.usage.output_tokens,
          totalTokens: data.usage.total_tokens,
          durationMs: data.duration_ms,
        },
      };
    }

    // Handle locally - the 95% path
    return doGenerate();
  },
};

// Wrap the local model with routing middleware
const hybridModel = wrapLanguageModel({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  middleware: hybridRoutingMiddleware,
});

// Use it exactly like any other model - routing is transparent
const { text } = await generateText({
  model: hybridModel,
  prompt: 'Summarize this customer feedback...',
  maxTokens: 200,
});

The middleware intercepts every generation call. Simple requests (summarization, short answers, structured extraction) run locally at zero cost. Complex requests transparently escalate to your cloud API endpoint. The calling code does not need to know which path was taken.

Step 4: Compose Multiple Middleware

Real production systems layer multiple concerns: routing, caching, logging, and guardrails. Use composeLanguageModelMiddleware() to stack them.

import {
  wrapLanguageModel,
  composeLanguageModelMiddleware,
} from '@localmode/core';
import type { LanguageModelMiddleware } from '@localmode/core';

// Cache responses to avoid redundant inference
const cachingMiddleware: LanguageModelMiddleware = {
  wrapGenerate: async ({ doGenerate, prompt }) => {
    const cached = responseCache.get(prompt);
    if (cached) return cached;
    const result = await doGenerate();
    responseCache.set(prompt, result);
    return result;
  },
};

// Log all generation calls for observability
const loggingMiddleware: LanguageModelMiddleware = {
  wrapGenerate: async ({ doGenerate, prompt, model }) => {
    const start = Date.now();
    const result = await doGenerate();
    console.log(
      `[${model.modelId}] ${prompt.slice(0, 40)}... → ${Date.now() - start}ms`
    );
    return result;
  },
};

// Stack: guardrails → cache → routing → logging
const composedMiddleware = composeLanguageModelMiddleware([
  loggingMiddleware,
  cachingMiddleware,
  hybridRoutingMiddleware,
]);

const productionModel = wrapLanguageModel({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  middleware: composedMiddleware,
});

Middleware are applied in order: loggingMiddleware wraps the outermost layer, cachingMiddleware checks the cache before hybridRoutingMiddleware decides local vs. cloud. This is the same composition pattern used by Vercel AI SDK middleware.

Step 5: Handle Device Fallbacks Gracefully

Not every device can run every local model. A low-end phone may struggle with a 2.2GB LLM but handle a 33MB embedding model without issue. Build your hybrid architecture to degrade gracefully.

import {
  embed,
  classify,
  detectCapabilities,
  checkModelSupport,
  isWebGPUSupported,
} from '@localmode/core';
import { transformers } from '@localmode/transformers';

async function createHybridEmbedder() {
  // Embeddings are lightweight - almost always run locally
  // Even on low-end devices, bge-small is only 33MB
  return {
    embed: async (text: string) => {
      const { embedding } = await embed({
        model: transformers.embedding('Xenova/bge-small-en-v1.5'),
        value: text,
      });
      return embedding;
    },
  };
}

async function createHybridClassifier() {
  const caps = await detectCapabilities();
  const hasEnoughMemory =
    caps.hardware.memory === undefined || caps.hardware.memory >= 2;

  if (hasEnoughMemory) {
    // Run classification locally - 95%+ of cloud quality
    return {
      classify: async (text: string) => {
        const { label, score } = await classify({
          model: transformers.classifier(
            'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
          ),
          text,
        });
        return { label, score, source: 'local' as const };
      },
    };
  }

  // Fallback to cloud API for very low-memory devices
  return {
    classify: async (text: string) => {
      const res = await fetch('/api/classify', {
        method: 'POST',
        body: JSON.stringify({ text }),
      });
      const data = await res.json();
      return { ...data, source: 'cloud' as const };
    },
  };
}

The pattern is consistent: check capabilities, prefer local, fall back to cloud. For lightweight models like embeddings (33MB) and classifiers (67MB), the local path works on virtually all modern browsers. For heavier models like LLMs (2-4GB), the capability check matters.

The Latency Advantage

The hybrid architecture is not just cheaper - it is faster for the local path. Cloud API calls include network round-trips that local inference avoids entirely.

Task	Local Latency	Cloud API Latency	Speedup
Embed single query	8-30ms	250-2,000ms	8-66x
Classify text	10-50ms	300-1,500ms	6-30x
Extract entities	30-100ms	500-3,000ms	5-30x
Rerank 20 documents	20-80ms	170-460ms	2-6x
Generate 100 tokens	2-5s	2-8s	1-1.6x

Local latencies are after model warm-up. Cloud latencies include network round-trip from a typical broadband connection and are based on published benchmarks and community reports. For embeddings and classification - the two highest-volume AI operations in most apps - the local path is an order of magnitude faster.

The first-load penalty (model download) is a one-time cost. Models cache in IndexedDB and load from local storage on subsequent visits, typically in under a second.

When Cloud Is the Right Choice

The hybrid architecture explicitly acknowledges that cloud APIs are the better tool for certain tasks. Do not try to force local models into these roles:

Complex multi-step reasoning. "Analyze this contract, identify the three most unusual clauses, and explain how they interact with Section 7" requires a frontier LLM's reasoning depth. Local models handle single-step extraction; cloud models handle chains of inference.

Long-context generation. GPT-5 processes 272K tokens in a single context window. Browser LLMs typically operate with 4-32K context. If your task requires synthesizing information across a 50-page document, use a cloud API.

Frontier-quality creative writing. Generating marketing copy that matches a brand voice, writing fiction with consistent characters, or producing nuanced professional communications - cloud models produce meaningfully better output for these tasks.

Broad multilingual support. Google Translate covers 243 languages in one API call. Local translation requires downloading a separate ~100MB model per language pair. If you need to support more than a handful of languages, cloud is more practical.

Very low-end devices. If your users are on budget phones with 2GB RAM and limited storage, even lightweight models may strain the device. Cloud APIs work on any device with an internet connection.

Cost Model: Why 95/5 Matters at Scale

The financial argument for hybrid architecture compounds with scale. Consider an application with 10,000 daily active users making an average of 50 AI calls per day:

Scenario	Cloud Calls/Year	Cost/Year	Local Calls/Year	Net Cost/Year
100% Cloud	182.5M	$50K - $300K+	0	$50K - $300K+
100% Local	0	$0	182.5M	$0
95/5 Hybrid	9.1M	$2.5K - $15K	173.4M	$2.5K - $15K

The hybrid approach captures 95% of the cost savings while maintaining 100% quality coverage. The 5% that goes to cloud APIs handles the genuinely complex requests where frontier models deliver meaningful value.

Infrastructure costs also drop. Fewer cloud API calls mean fewer backend proxy servers, fewer API key rotation systems, fewer rate-limiting headaches, and fewer 3am billing alerts.

Getting Started

Install the packages and start routing:

npm install @localmode/core @localmode/transformers @localmode/webllm

import { embed, classify, generateText, wrapLanguageModel } from '@localmode/core';
import { transformers } from '@localmode/transformers';
import { webllm } from '@localmode/webllm';

// Local path: embeddings and classification (the 95%)
const { embedding } = await embed({
  model: transformers.embedding('Xenova/bge-small-en-v1.5'),
  value: 'How do I reset my password?',
});

const { label } = await classify({
  model: transformers.classifier(
    'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
  ),
  text: 'This product is amazing!',
});

// Hybrid path: LLM with middleware routing (the 5%)
const hybridModel = wrapLanguageModel({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  middleware: hybridRoutingMiddleware, // See Step 3 above
});

const { text } = await generateText({
  model: hybridModel,
  prompt: 'Summarize the key points from this feedback...',
  maxTokens: 200,
});

Models download once, cache in the browser, and run offline from that point forward. The hybrid routing middleware handles escalation transparently. Your application code calls the same API regardless of which path is taken.

Methodology

Quality percentages are from our benchmark analysis comparing 18 local model categories against cloud APIs on standard academic benchmarks (MTEB, SQuAD, CoNLL-2003, BLEU, WER, MMLU-Redux).

Sources

Local Model Benchmarks:

BAAI/bge-small-en-v1.5 -- MTEB 62.17, 384 dimensions, 33MB
Qwen3 Technical Report (arXiv:2505.09388) -- Qwen3-4B MMLU-Redux 83.7-84.2%
Qwen3.5-4B model card -- MMLU-Redux 88.8% (thinking mode)
OpenAI text-embedding-3-small -- MTEB 62.3
OpenAI GPT-5 announcement -- MMLU 92.5%

Cloud API Latency:

Embedding provider latency benchmarks (Nixie Search, 2025) -- OpenAI embedding P50 ~250ms-2s
Cohere Rerank latency benchmarks -- Cohere Rerank 3.5 P50 ~171ms (small), ~459ms (large)
Azure OpenAI embedding latency reports -- Azure OpenAI embedding ~1.2s typical

Pricing:

OpenAI API Pricing -- GPT-5 $1.25/$10 per 1M tokens, embeddings $0.020/1M tokens
Cohere Pricing -- Rerank $2/1K searches
Google Cloud Translation Pricing -- $20/1M characters

All code examples use real LocalMode APIs verified against the packages/core/src/ and packages/transformers/src/ source code. All model names are real HuggingFace models with published ONNX weights.

Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.

Frequently Asked Questions