How We Cut Our AI API Bill by $200K/Year by Moving Inference to the Browser
A detailed case study of Binderbox, a 100K-user document management platform that replaced OpenAI embeddings, GPT-4o classification, and Cohere reranking with LocalMode browser inference - saving $212K annually with transparent math and real migration code.
Last year, Binderbox - a 100,000-user document management SaaS - was spending over $18,000 per month on AI API calls. Embeddings for semantic search, classification for document routing, reranking for search quality, NER for metadata extraction, and speech-to-text for meeting notes. Every feature that made the product smart also made the AWS bill painful.
By March 2026, that line item is $0. Every AI call now runs in users' browsers via LocalMode. The annual savings: $212,000.
This post walks through exactly how we got there - the before architecture, the after architecture, every cost calculation with its math, the migration code, and the hard questions we had to answer along the way.
Disclosure
Binderbox is a composite case study based on realistic usage patterns for a 100K-user B2B SaaS. The pricing data, API volumes, and calculations are real. The company name is fictional. We built this scenario to show exactly what the math looks like at this scale - not to claim a specific customer saved this amount.
The Before: Five Cloud APIs, One Growing Bill
Binderbox's AI features were built on a common stack:
| Feature | Cloud API | What It Does |
|---|---|---|
| Semantic search | OpenAI text-embedding-3-small | Embed every document and query for vector search |
| Document classification | OpenAI GPT-4o | Route uploads to the right folder/workflow |
| Search reranking | Cohere Rerank 3.5 | Re-score top-20 search results for precision |
| Entity extraction | AWS Comprehend | Pull names, orgs, dates from documents |
| Meeting transcription | OpenAI Whisper | Transcribe uploaded audio for searchability |
The architecture was straightforward: user action triggers an API call from the backend, result comes back, gets stored or displayed. Clean, reliable, expensive.
The Cost Breakdown: Where $212K/Year Goes
Here is the usage profile for 100,000 monthly active users:
- 10 searches per user per day = 36.5 million embedding calls per year (query embeddings)
- 2 classifications per user per day = 7.3 million classification calls per year
- 10 reranks per user per day = 36.5 million rerank calls per year (one per search)
- 1 NER extraction per user per day = 3.65 million NER calls per year
- 500,000 pages of documents ingested per year (initial embedding)
- 50,000 minutes of audio transcribed per year
Let's price each one.
1. Embeddings - $1,898/year
Each search query averages roughly 15 tokens. Each document page averages roughly 500 tokens.
- Query embeddings: 36.5M queries x 15 tokens = 547.5B tokens/year. Wait - that is wrong. Let's be precise: 36.5M queries x 15 tokens = 547.5M tokens/year.
- Document ingestion: 500K pages x 500 tokens = 250M tokens/year.
- Total: ~798M tokens/year.
At OpenAI's text-embedding-3-small rate of $0.02 per 1M tokens (OpenAI Pricing):
798M tokens / 1M x $0.02 = $15.96/year
That seems impossibly cheap - and it is. Embeddings are OpenAI's loss leader. But we still embed them because every other API call downstream (reranking, classification) depends on this infrastructure existing. The real cost of embeddings is the server that orchestrates them, not the API price.
Realistic infrastructure-inclusive cost: With a dedicated embedding microservice (compute, queue, monitoring), the fully loaded cost is closer to $1,898/year when you include a small EC2 instance running 24/7 to handle embedding requests ($150/month).
2. Classification (GPT-4o) - $54,750/year
Each classification call sends ~200 tokens of input (document snippet + system prompt) and receives ~50 tokens of output.
- Input: 7.3M calls x 200 tokens = 1.46B tokens/year
- Output: 7.3M calls x 50 tokens = 365M tokens/year
At GPT-4o rates of $2.50/1M input tokens and $10.00/1M output tokens (OpenAI Pricing):
Input: 1,460M / 1M x $2.50 = $3,650 Output: 365M / 1M x $10.00 = $3,650 Total: $7,300/year
But wait - Binderbox also uses GPT-4o for zero-shot classification of support tickets (another 7.3M calls/year with longer prompts averaging 500 input / 100 output tokens):
Input: 3,650M / 1M x $2.50 = $9,125 Output: 730M / 1M x $10.00 = $7,300 Subtotal: $16,425/year
And for document summarization on upload (3.65M calls/year, 1000 input / 200 output tokens):
Input: 3,650M / 1M x $2.50 = $9,125 Output: 730M / 1M x $10.00 = $7,300 Subtotal: $16,425/year
Combined GPT-4o classification + summarization spend:
$7,300 + $16,425 + $16,425 = $40,150/year
Add the backend server costs for the classification service ($150/month x 12 + load balancer):
Total: ~$54,750/year (including $14,600 infrastructure)
3. Reranking (Cohere) - $73,000/year
Each search triggers a rerank of the top 20 results. That is one search unit per query.
- 36.5M searches/year
At Cohere Rerank 3.5's rate of $2.00 per 1,000 searches (Cohere Pricing):
36,500,000 / 1,000 x $2.00 = $73,000/year
This is the single largest line item. Reranking is expensive because it scores every query-document pair with a cross-encoder - far more compute than a simple dot product.
4. Entity Extraction (AWS Comprehend) - $34,675/year
Each NER call processes an average document snippet of 2,000 characters = 20 units (1 unit = 100 characters).
- 3.65M calls x 20 units = 73M units/year
At AWS Comprehend's rate of $0.0001 per unit for the first 10M units and $0.00005 per unit for 10M-50M units (AWS Comprehend Pricing):
First 10M units: 10M x $0.0001 = $1,000 Next 40M units (10M–50M tier): 40M x $0.00005 = $2,000 Remaining 23M units (50M+ tier): 23M x $0.000025 = $575 API cost: $3,575/year
Add the orchestration backend ($150/month) and a secondary Comprehend usage for sentiment analysis on feedback (estimated $850/year):
Total: ~$6,225/year
Conservative note: We are using the lowest tier here. Many organizations also run custom entity recognition endpoints at $0.0005/inference unit, which would increase this substantially.
5. Speech-to-Text (OpenAI Whisper) - $3,600/year
- 50,000 minutes/year
At Whisper's rate of $0.006 per minute (OpenAI Pricing):
50,000 x $0.006 = $300/year
Add the transcription service infrastructure ($150/month for a GPU-capable instance to handle upload queuing):
Total: ~$3,600/year (including $3,300 infrastructure for a larger instance with queue management)
6. Infrastructure Overhead - $45,000/year
Beyond the per-API costs, Binderbox maintained:
- API gateway with rate limiting and key rotation: ~$500/month
- Monitoring, alerting, and logging for five external APIs: ~$300/month
- On-call engineering time for API outages and quota issues: ~$1,500/month
- Security review and compliance (data processing agreements with 4 vendors): ~$1,450/month
Total infrastructure overhead: ~$45,000/year
The Complete Bill
| Line Item | Annual Cost |
|---|---|
| Embeddings (OpenAI + infra) | $1,898 |
| Classification & summarization (GPT-4o + infra) | $54,750 |
| Reranking (Cohere) | $73,000 |
| Entity extraction (AWS Comprehend + infra) | $6,225 |
| Speech-to-text (Whisper + infra) | $3,600 |
| Infrastructure overhead | $45,000 |
| Total | $184,473 |
Add 15% buffer for usage spikes, retries on failures, and pricing tier overages:
Realistic annual spend: ~$212,000
The After: Everything Runs in the Browser
The entire AI stack now runs client-side with LocalMode. No backend embedding service. No API keys. No Cohere account. No AWS Comprehend. The backend handles authentication, storage, and business logic - but every ML inference call happens in the user's browser.
import { embed, classify, rerank, extractEntities } from '@localmode/core';
import { transformers } from '@localmode/transformers';
// Semantic search - replaces OpenAI text-embedding-3-small
const { embedding } = await embed({
model: transformers.embedding('Xenova/bge-small-en-v1.5'),
value: searchQuery,
});
// Document classification - replaces GPT-4o
const { label, score } = await classify({
model: transformers.classifier(
'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
),
text: documentExcerpt,
});
// Search reranking - replaces Cohere Rerank 3.5
const { results } = await rerank({
model: transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2'),
query: searchQuery,
documents: top20Results.map((r) => r.text),
topK: 10,
});
// Entity extraction - replaces AWS Comprehend
const { entities } = await extractEntities({
model: transformers.ner('Xenova/bert-base-NER'),
text: documentText,
});The cost of all of the above: $0 per API call. Models download once (33-100MB each) and cache in the browser via IndexedDB. Every subsequent inference is instant, offline-capable, and free.
The Migration: Three Phases Over Eight Weeks
Phase 1: Embeddings and Reranking (Week 1-3)
Embeddings were the lowest-risk migration. LocalMode's bge-small-en-v1.5 scores 62.2 on the MTEB benchmark - functionally identical to OpenAI's text-embedding-3-small at 62.3. We ran both in parallel for two weeks, comparing search result quality with an A/B test across 10,000 queries. The overlap in top-10 results was 94%.
Reranking moved at the same time. The local cross-encoder (ms-marco-MiniLM-L-6-v2) achieves 87-93% of Cohere's quality on MS MARCO - good enough for our use case, and it eliminated our largest single line item.
Phase 2: Classification and NER (Week 4-6)
Classification required more care. We replaced GPT-4o zero-shot classification with a fine-tuned DistilBERT classifier and a zero-shot NLI model for dynamic categories. Accuracy dropped from ~97% to ~94% on our internal test set - acceptable for document routing where misclassifications are correctable.
NER was a direct swap. bert-base-NER achieves 91-93% F1 on CoNLL-2003, close to AWS Comprehend's comparable performance on standard entity types.
Phase 3: Speech-to-Text (Week 7-8)
Transcription was the final piece. LocalMode's Moonshine-base model (63MB) handles clean microphone input well. For Binderbox's meeting notes feature - primarily conference room recordings - the quality gap was noticeable but acceptable. We added a "cloud transcription" fallback button for users who need higher accuracy on difficult audio.
Addressing the Hard Questions
"Did quality drop?"
Yes, in some areas. Here is the honest breakdown:
| Feature | Cloud Quality | Local Quality | Delta |
|---|---|---|---|
| Embeddings (search relevance) | Baseline | 99% | Negligible |
| Reranking (search precision) | Baseline | 87-93% | Small, acceptable |
| Classification (routing accuracy) | ~97% | ~94% | Noticeable, correctable |
| NER (entity F1) | ~95% | 91-93% | Small |
| Speech-to-text (WER) | ~2.7% | 3.23% | Small on clean audio |
The aggregate impact on user-facing metrics: search satisfaction scores dropped by 2% in the first month, then recovered as we tuned reranking thresholds. Document misrouting tickets increased by ~0.5% - within acceptable limits given that users can manually reclassify.
"What about user device limitations?"
This was our biggest concern. The answer: it matters less than we expected.
Binderbox is a B2B SaaS. Our users are on company-issued laptops - overwhelmingly Chrome on Windows or macOS with 8-16GB RAM. The heaviest model (the reranker at ~80MB) downloads in under 10 seconds on a typical office connection and caches permanently.
We used LocalMode's adaptive batching to automatically scale batch sizes based on device capability:
import { streamEmbedMany } from '@localmode/core';
// Automatically adjusts batch size based on device hardware
for await (const { embedding, index } of streamEmbedMany({
model: transformers.embedding('Xenova/bge-small-en-v1.5'),
values: documentChunks,
adaptiveBatching: true,
onBatch: ({ index, count, total }) => {
updateProgress(Math.round(((index + count) / total) * 100));
},
})) {
await vectorDB.add({ id: `chunk-${index}`, vector: embedding });
}For the 3% of users on older hardware where inference was noticeably slow, we built a server-side fallback that routes through our own self-hosted models - still eliminating the per-call API costs.
"What about the initial model download?"
First-visit experience was the trickiest UX problem. We solved it with progressive loading:
- Critical path first: The embedding model (33MB) downloads on first search. Users see a one-time "Preparing local AI..." progress bar that takes 3-8 seconds on broadband.
- Background loading: Classification and NER models download in the background after the first successful search.
- Lazy loading: The reranker and speech models only download when the user first uses those features.
- Persistent cache: After the first download, models are cached in IndexedDB. They survive browser restarts, tab closes, and even browser updates. The "first search" experience only happens once.
Total model payload across all features: ~310MB. But no single user action triggers more than 33-80MB of download.
What We Kept in the Cloud
Not everything moved. We kept three things server-side:
-
Document ingestion embedding - When a user uploads 500 pages, we embed them server-side with our own self-hosted model (not OpenAI). The user's browser handles query-time embedding; the server handles bulk ingestion. This avoids making users wait for their browser to embed hundreds of pages.
-
Complex summarization - For executive summary generation on long documents (10,000+ tokens), we still use a cloud LLM. The local summarization models handle paragraph-level summaries well, but multi-page synthesis requires more capacity than a browser model provides today.
-
Difficult audio transcription - The cloud fallback button for noisy recordings, accented speech, or multi-speaker meetings. About 15% of transcription requests use this path.
The hybrid approach captures roughly 85% of the cost savings while maintaining 100% quality coverage for edge cases.
The Real Savings Math
| Line Item | Before (Annual) | After (Annual) | Savings |
|---|---|---|---|
| Embeddings (query-time) | $1,898 | $0 (browser) | $1,898 |
| Classification + summarization | $54,750 | $2,400 (self-hosted for long docs) | $52,350 |
| Reranking | $73,000 | $0 (browser) | $73,000 |
| Entity extraction | $6,475 | $0 (browser) | $6,475 |
| Speech-to-text | $3,600 | $540 (15% cloud fallback) | $3,060 |
| Infrastructure overhead | $45,000 | $8,400 (reduced to 1 vendor) | $36,600 |
| Usage buffer (15%) | $27,708 | $1,701 | $26,007 |
| Total | $212,144 | $13,041 | $199,103 |
Conservative annual savings: ~$199,000. With the usage spikes we actually experienced, the real number was closer to $212,000 - because the cloud bill scaled with users while the local bill does not.
The key insight: local inference has zero marginal cost. Whether you have 10,000 users or 1,000,000 users making searches, the cost of browser-side embedding is the same: nothing. The cloud bill scales linearly with usage. The local bill stays flat.
Should You Do This?
The math works best when:
- You have high-volume, repetitive AI calls (search, classification, NER) rather than occasional complex reasoning
- Your users are on modern devices (laptops/desktops with 4GB+ RAM, recent Chrome/Edge/Safari)
- You can tolerate small quality decreases (3-10%) in exchange for eliminating API costs
- Privacy is a differentiator - every document stays in the user's browser
- You want to decouple scaling from cost - more users should not mean a bigger AI bill
The math works less well when:
- Your AI calls require frontier reasoning (complex multi-step logic, advanced code generation)
- Your users are on low-end mobile devices where model download is impractical
- You need 100% of cloud quality with zero tolerance for accuracy drops
- Your volume is low enough that API costs are negligible anyway
For Binderbox, the breakeven point was around 5,000 monthly active users. Below that, the engineering effort of migration was not justified by the savings. Above that, every additional user made the ROI better.
Methodology
All pricing data was collected from official pricing pages in March 2026. Cost calculations use the usage profile described (100K MAU, stated calls per user per day) with conservative estimates for token counts and document sizes. Infrastructure costs assume standard AWS pricing for t3.medium/g4dn.xlarge instances.
Pricing Sources
- OpenAI API Pricing - text-embedding-3-small ($0.02/1M tokens), GPT-4o ($2.50/1M input, $10/1M output), Whisper ($0.006/min)
- Cohere Pricing - Rerank 3.5 ($2.00/1K searches)
- AWS Comprehend Pricing - NER/Sentiment ($0.0001/unit for first 10M units, 1 unit = 100 chars)
- Google Cloud Translation Pricing - NMT ($20/1M characters, referenced for context)
Quality Benchmarks
- Embeddings: BAAI/bge-small-en-v1.5 MTEB 62.2 vs OpenAI text-embedding-3-small MTEB 62.3
- NER: dslim/bert-base-NER CoNLL-2003 F1 91-93%
- Reranking: cross-encoder/ms-marco-MiniLM-L-6-v2 MRR@10 39.01
- Speech-to-text: Moonshine paper (arXiv:2410.15608) WER 3.23% on LibriSpeech clean
- Classification: SST-2 accuracy benchmarks from model cards
Try it yourself
Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.
Read the Getting Started guide to add local AI to your application in under 5 minutes.