Reranking
Improve RAG accuracy by reranking retrieved documents
Reranking improves the accuracy of RAG (Retrieval-Augmented Generation) pipelines by re-scoring documents based on their relevance to a query. After initial vector search retrieves candidates, reranking provides more precise ordering.
Why Rerank?
Vector search retrieves documents based on embedding similarity, but rerankers use cross-attention to directly score query-document pairs, often producing more accurate rankings for the final generation step.
Typical RAG pipeline:
- Retrieve — Get 20-50 candidates via vector search (fast, approximate)
- Rerank — Score and reorder candidates (precise, slower)
- Generate — Use top 5-10 documents for LLM context
Basic Usage
import { rerank } from '@localmode/core';
import { transformers } from '@localmode/transformers';
// Create reranker model
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const { results } = await rerank({
model: rerankerModel,
query: 'What is machine learning?',
documents: [
'Machine learning is a type of artificial intelligence...',
'Cooking pasta requires boiling water...',
'Deep learning is a subset of machine learning...',
],
topK: 2,
});
// results: [
// { index: 0, score: 0.95, text: 'Machine learning is a type of...' },
// { index: 2, score: 0.88, text: 'Deep learning is a subset of...' }
// ]RAG Pipeline Example
Perform Initial Vector Search
Retrieve more candidates than you need—reranking will filter to the best ones.
import { semanticSearch, createVectorDB, embed } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const embeddingModel = transformers.embedding('Xenova/all-MiniLM-L6-v2');
// Get 20 candidates from vector search
const { embedding: queryVector } = await embed({
model: embeddingModel,
value: 'What is machine learning?',
});
const candidates = await db.search(queryVector, { k: 20 });Rerank the Candidates
Score each document against the query for precise relevance ranking.
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const { results } = await rerank({
model: rerankerModel,
query: 'What is machine learning?',
documents: candidates.map((c) => c.metadata.text),
topK: 5, // Keep only top 5 after reranking
});Use Top Results for Generation
Pass the reranked documents as context to your LLM.
const context = results.map((r) => r.text).join('\n\n');
const response = await streamText({
model: languageModel,
prompt: `Based on the following context, answer the question.
Context:
${context}
Question: What is machine learning?`,
});API Reference
rerank(options)
Reranks documents by relevance to a query.
Prop
Type
Return Type: RerankResult
Prop
Type
RankedDocument
Prop
Type
Supported Models
Cross-encoder models score query-document pairs directly:
| Model | Size | Speed | Quality | Use Case |
|---|---|---|---|---|
Xenova/ms-marco-MiniLM-L-6-v2 | 23MB | Fast | Good | General purpose |
Xenova/ms-marco-MiniLM-L-12-v2 | 33MB | Medium | Better | Higher accuracy |
Xenova/bge-reranker-base | 110MB | Slower | Best | Maximum quality |
Choose based on your needs:
- Speed-critical: Use
ms-marco-MiniLM-L-6-v2for fast inference - Balanced: Use
ms-marco-MiniLM-L-12-v2for good accuracy with reasonable speed - Quality-critical: Use
bge-reranker-basewhen accuracy matters most
Start with ms-marco-MiniLM-L-6-v2—it's a great balance of speed and quality for most
applications.
Cancellation Support
All reranking operations support AbortSignal for cancellation:
const controller = new AbortController();
// Cancel after 5 seconds
setTimeout(() => controller.abort(), 5000);
try {
const { results } = await rerank({
model: rerankerModel,
query: 'What is AI?',
documents: largeDocumentSet,
abortSignal: controller.signal,
});
} catch (error) {
if (error.name === 'AbortError') {
console.log('Reranking was cancelled');
}
}Performance Tips
Optimize your reranking pipeline:
- Limit candidates: Retrieve 20-50 candidates, not hundreds
- Use topK: Only return the documents you need
- Batch when possible: Rerank multiple queries together if your use case allows
- Cache results: Consider caching reranked results for repeated queries
Custom Reranker Implementation
Implement the RerankerModel interface to create custom rerankers:
import type { RerankerModel, DoRerankOptions, DoRerankResult } from '@localmode/core';
class MyCustomReranker implements RerankerModel {
readonly modelId = 'custom:my-reranker';
readonly provider = 'custom';
async doRerank(options: DoRerankOptions): Promise<DoRerankResult> {
const { query, documents, topK } = options;
// Your scoring logic here
const scored = documents.map((doc, index) => ({
index,
score: this.scoreDocument(query, doc),
text: doc,
}));
// Sort by score descending
scored.sort((a, b) => b.score - a.score);
// Apply topK
const results = topK ? scored.slice(0, topK) : scored;
return {
results,
usage: {
inputTokens: query.length + documents.join('').length,
durationMs: 0,
},
};
}
private scoreDocument(query: string, document: string): number {
// Implement your scoring logic
return 0.5;
}
}