Semantic Cache
Cache LLM responses using embedding similarity for instant repeat queries.
Cache LLM responses based on prompt embedding similarity. When a user asks a question that is semantically similar to a previously answered question, the cached response is returned in under 50ms instead of waiting 5-30 seconds for model inference.
See it in action
Try PDF Search and LLM Chat for working demos of these APIs.
createSemanticCache()
Create a semantic cache instance:
import { createSemanticCache } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const cache = await createSemanticCache({
embeddingModel: transformers.embedding('Xenova/bge-small-en-v1.5'),
threshold: 0.92, // Cosine similarity threshold for cache hits
maxEntries: 100, // Maximum cached entries (LRU eviction)
ttlMs: 3600000, // 1 hour TTL
});SemanticCacheConfig
Prop
Type
lookup()
Look up a cached response for a prompt:
const result = await cache.lookup({
prompt: 'Explain machine learning',
modelId: 'llama-3.2',
});
if (result.hit) {
console.log(result.response); // Cached response text
console.log(result.score); // Similarity score (e.g., 0.95)
console.log(result.durationMs); // Lookup time in ms
}CacheLookupResult
Prop
Type
Lookup behavior
- Exact match fast path -- If the normalized prompt exactly matches a cached entry, the response is returned without embedding.
- Embedding similarity -- The prompt is embedded and compared against all cached entries via HNSW index.
- Model filtering -- Only entries from the same
modelIdare considered. - TTL check -- Expired entries are removed lazily and treated as misses.
store()
Store a prompt-response pair:
const { entryId } = await cache.store({
prompt: 'What is machine learning?',
response: 'Machine learning is a subset of AI that...',
modelId: 'llama-3.2',
});When the cache reaches maxEntries, the least-recently-accessed entry is evicted before the new entry is stored.
clear()
Clear cached entries:
// Clear all entries
await cache.clear();
// Clear entries for a specific model only
await cache.clear({ modelId: 'llama-3.2' });stats()
Get cache statistics:
const stats = cache.stats();
console.log(stats.entries); // Number of cached entries
console.log(stats.hits); // Total cache hits
console.log(stats.misses); // Total cache misses
console.log(stats.hitRate); // Hit rate (0-1)CacheStats
Prop
Type
destroy()
Release all resources:
await cache.destroy();
// All subsequent operations will throw SemanticCacheErrorsemanticCacheMiddleware()
Transparently cache generateText() and streamText() calls using Language Model Middleware:
import {
createSemanticCache,
semanticCacheMiddleware,
wrapLanguageModel,
generateText,
} from '@localmode/core';
import { transformers } from '@localmode/transformers';
import { webllm } from '@localmode/webllm';
// Create cache
const cache = await createSemanticCache({
embeddingModel: transformers.embedding('Xenova/bge-small-en-v1.5'),
});
// Wrap model with cache middleware
const cachedModel = wrapLanguageModel({
model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16'),
middleware: semanticCacheMiddleware(cache),
});
// First call: model generates, result is cached (~5-30s)
const result1 = await generateText({ model: cachedModel, prompt: 'What is AI?' });
// Second call with similar prompt: cached response (~50ms)
const result2 = await generateText({ model: cachedModel, prompt: 'Explain AI' });How middleware works
| Operation | Cache Hit | Cache Miss |
|---|---|---|
doGenerate | Returns cached text with finishReason: 'stop' and zero-token usage | Calls model, stores result, returns normally |
doStream | Yields single chunk with full cached text and done: true | Streams from model, buffers text, stores after completion |
TTL and LRU Behavior
- TTL expiration is checked lazily during
lookup(). Expired entries are removed when accessed. - LRU eviction occurs during
store()whenmaxEntriesis reached. The least-recently-accessed entry is evicted. - Expired entries still occupy memory until accessed. This is bounded by
maxEntriesregardless.
Prompt Normalization
When normalize: true (default), prompts are normalized before embedding:
| Input | Normalized |
|---|---|
" What is AI? " | "what is ai?" |
"EXPLAIN\nMACHINE\tLEARNING" | "explain machine learning" |
This prevents trivial mismatches from whitespace, casing, or formatting differences.
Events
The cache emits events via globalEventBus:
import { globalEventBus } from '@localmode/core';
globalEventBus.on('cacheHit', ({ prompt, score, modelId, entryId }) => {
console.log(`Cache hit: ${prompt} (score: ${score})`);
});
globalEventBus.on('cacheMiss', ({ prompt, modelId }) => {
console.log(`Cache miss: ${prompt}`);
});
globalEventBus.on('cacheStore', ({ prompt, modelId, entryId }) => {
console.log(`Cached: ${prompt}`);
});
globalEventBus.on('cacheEvict', ({ entryId, reason }) => {
console.log(`Evicted: ${entryId} (${reason})`);
});
globalEventBus.on('cacheClear', ({ entriesRemoved }) => {
console.log(`Cleared ${entriesRemoved} entries`);
});Error Handling
import { SemanticCacheError } from '@localmode/core';
try {
await cache.lookup({ prompt: 'test', modelId: 'model' });
} catch (error) {
if (error instanceof SemanticCacheError) {
console.log(error.code);
// 'CACHE_DESTROYED' | 'CACHE_LOOKUP_FAILED' | 'CACHE_STORE_FAILED'
console.log(error.hint);
}
}React Hook
Use useSemanticCache() from @localmode/react for automatic lifecycle management:
import { useSemanticCache } from '@localmode/react';
import { transformers } from '@localmode/transformers';
function CachedApp() {
const { cache, stats, isLoading } = useSemanticCache({
embeddingModel: transformers.embedding('Xenova/bge-small-en-v1.5'),
});
if (isLoading || !cache) return <p>Initializing cache...</p>;
return (
<div>
<p>Entries: {stats.entries} | Hit rate: {(stats.hitRate * 100).toFixed(1)}%</p>
</div>
);
}The hook creates the cache on mount and calls destroy() on unmount.
Recommended Configuration
- Use the same embedding model you already have loaded for search/RAG to avoid loading a second model
- Start with the default threshold (0.92) and adjust based on your use case
- Set
maxEntriesbased on expected unique queries per session (100 is suitable for most apps) - Use
storage: 'indexeddb'only if you need cache persistence across page reloads