RAG
Build retrieval-augmented generation pipelines with chunking, ingestion, and hybrid search.
RAG (Retrieval-Augmented Generation) combines vector search with language models to answer questions from your documents. LocalMode provides all the building blocks: chunking, ingestion, semantic search, reranking, and hybrid search.
See it in action
Try Semantic Search and PDF Search for working demos of these APIs.
RAG Pipeline Overview
Chunk Documents
Split documents into smaller, semantically meaningful pieces.
import { chunk } from '@localmode/core';
const chunks = chunk(documentText, {
strategy: 'recursive',
size: 512,
overlap: 50,
});Generate Embeddings & Store
Create embeddings and store in a vector database.
import { ingest, createVectorDB } from '@localmode/core';
const db = await createVectorDB({ name: 'docs', dimensions: 384 });
await ingest({ db, model: embeddingModel, documents: chunks });Search & Retrieve
Find relevant chunks using semantic search.
import { semanticSearch } from '@localmode/core';
const results = await semanticSearch({
db,
model: embeddingModel,
query: userQuestion,
k: 10,
});Rerank for Precision
Optionally rerank results for better accuracy.
import { rerank } from '@localmode/core';
const reranked = await rerank({
model: rerankerModel,
query: userQuestion,
documents: results.map((r) => r.metadata.text),
topK: 5,
});Generate Answer
Use an LLM to generate an answer from the context.
import { streamText } from '@localmode/core';
const stream = await streamText({
model: llm,
prompt: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
});Chunking
Split documents into smaller pieces for better retrieval:
ChunkOptions
Prop
Type
The separators and language options are strategy-specific. separators only applies to the 'recursive' strategy. language only applies to the 'code' strategy. Both are ignored when using other strategies.
Semantic Chunking
Semantic chunking uses an embedding model to split text at actual topic boundaries, rather than relying on syntactic markers like paragraphs or sentences. This produces chunks that are internally coherent and separated at natural meaning shifts.
import { semanticChunk } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.embedding('Xenova/bge-small-en-v1.5');
const chunks = await semanticChunk({
text: longDocument,
model,
size: 1000, // Max chunk size (default: 2000)
minSize: 100, // Min chunk size (default: 100)
threshold: 0.4, // Optional: explicit similarity threshold
});
chunks.forEach((c) => {
console.log(`Chunk ${c.index}: ${c.text.substring(0, 60)}...`);
console.log(` Boundaries:`, c.metadata?.semanticBoundaries);
});How It Works
- Pre-split — Text is split into sentence-level segments using the recursive chunker
- Embed — Each segment is embedded using the provided model (single
embedMany()call) - Compare — Cosine similarity is computed between adjacent segment embeddings
- Detect — Breakpoints are identified where similarity drops below the threshold
- Merge — Segments between breakpoints are merged into final chunks with size constraints
Auto-Threshold Detection
When no threshold is specified, the algorithm computes it automatically from the similarity distribution using mean - standard deviation. This adapts to any embedding model's similarity range.
// Auto-threshold (recommended for most use cases)
const chunks = await semanticChunk({ text, model });
// Explicit threshold (when you know your model's similarity range)
const chunks = await semanticChunk({ text, model, threshold: 0.5 });threshold: 0produces a single chunk (no breakpoints)threshold: 1.0produces maximum splitting (nearly every boundary is a break)
SemanticChunkOptions
Prop
Type
Boundary Metadata
Each chunk includes semantic boundary scores in metadata.semanticBoundaries:
interface SemanticBoundaries {
leftSimilarity: number | null; // null for first chunk
rightSimilarity: number | null; // null for last chunk
}Lower similarity at a boundary means a stronger topic shift. Use this for debugging chunk quality or implementing adaptive retrieval strategies.
Reusable Chunker Factory
import { createSemanticChunker } from '@localmode/core';
const chunker = createSemanticChunker({
model,
threshold: 0.4,
size: 1000,
});
const chunks1 = await chunker('First document...');
const chunks2 = await chunker('Second document...', { threshold: 0.6 });Pipeline Integration
Use pipelineSemanticChunkStep() in pipeline workflows:
import { createPipeline, pipelineSemanticChunkStep, pipelineEmbedManyStep } from '@localmode/core';
const pipeline = createPipeline('semantic-rag')
.step('chunk', pipelineSemanticChunkStep(embeddingModel, { size: 1000 }))
.step('embed', pipelineEmbedManyStep(embeddingModel))
.build();
const { result } = await pipeline.run(documentText);React Hook
Use useSemanticChunk from @localmode/react for component-level semantic chunking with built-in loading, error, and cancellation state:
import { useSemanticChunk } from '@localmode/react';
import { transformers } from '@localmode/transformers';
const model = transformers.embedding('Xenova/bge-small-en-v1.5');
function SemanticChunker() {
const { data: chunks, isLoading, error, execute, cancel } = useSemanticChunk({
model,
threshold: 0.4,
size: 1000,
});
return (
<div>
<button onClick={() => execute('Long document text...')} disabled={isLoading}>
Chunk
</button>
{isLoading && <button onClick={cancel}>Cancel</button>}
{chunks?.map((c) => <p key={c.index}>{c.text.substring(0, 80)}...</p>)}
</div>
);
}Semantic vs. Recursive Chunking
| Aspect | Recursive | Semantic |
|---|---|---|
| Speed | Fast (synchronous) | Slower (requires embedding) |
| Topic awareness | None (syntactic only) | High (embedding-based) |
| Dependencies | None | Requires EmbeddingModel |
| Best for | General documents, code, markdown | Long documents with topic shifts |
| API | chunk(text, options) | semanticChunk(options) |
Semantic chunking is async because it requires embedding each segment. For a 5,000-word document (~200 sentences at 384 dimensions), this involves a single embedMany() call. Use abortSignal for cancellation on long documents.
Ingestion
Ingest documents into a vector database:
import { createVectorDB, ingest } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.embedding('Xenova/bge-small-en-v1.5');
const db = await createVectorDB({ name: 'docs', dimensions: 384 });
await ingest({
db,
model,
documents: [
{ text: 'First document...', metadata: { source: 'doc1.txt' } },
{ text: 'Second document...', metadata: { source: 'doc2.txt' } },
],
});With Automatic Chunking
await ingest({
db,
model,
documents: [{ text: longDocument, metadata: { source: 'book.txt' } }],
chunkOptions: {
strategy: 'recursive',
size: 512,
overlap: 50,
},
});With Progress Tracking
await ingest({
db,
model,
documents: largeDocumentArray,
onProgress: (progress) => {
console.log(`Ingested ${progress.completed}/${progress.total} documents`);
},
});Semantic Search
Search for relevant chunks:
import { semanticSearch } from '@localmode/core';
const results = await semanticSearch({
db,
model,
query: 'What are the benefits of machine learning?',
k: 5,
});
results.forEach((r) => {
console.log(`Score: ${r.score.toFixed(3)}`);
console.log(`Text: ${r.metadata.text}`);
});Reranking
Improve results with cross-encoder reranking:
import { rerank } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
// Get initial results
const results = await semanticSearch({ db, model, query, k: 20 });
// Rerank for better accuracy
const { results: reranked } = await rerank({
model: rerankerModel,
query,
documents: results.map((r) => r.metadata.text as string),
topK: 5,
});
reranked.forEach((r) => {
console.log(`Score: ${r.score.toFixed(3)}`);
console.log(`Text: ${r.text.substring(0, 100)}...`);
});When to Use Reranking
Reranking improves accuracy but adds latency. Use it when: - Accuracy is more important than speed
- You're building a Q&A system - Initial results may have false positives
BM25 Keyword Search
For exact keyword matching:
import { createBM25 } from '@localmode/core';
const bm25 = createBM25(documents.map((d) => d.text));
const keywordResults = bm25.search('machine learning');
keywordResults.forEach((r) => {
console.log(`Score: ${r.score.toFixed(3)}, Index: ${r.index}`);
});Hybrid Search
Combine semantic and keyword search:
import { semanticSearch, createBM25, hybridFuse } from '@localmode/core';
// Semantic search
const semanticResults = await semanticSearch({ db, model, query, k: 20 });
// BM25 keyword search
const bm25 = createBM25(documents.map((d) => d.text));
const keywordResults = bm25.search(query);
// Combine with fusion
const hybridResults = hybridFuse({
semantic: semanticResults.map((r) => ({
id: r.id,
score: r.score,
})),
keyword: keywordResults.map((r) => ({
id: documents[r.index].id,
score: r.score,
})),
k: 10,
alpha: 0.7, // Weight for semantic (0.7 = 70% semantic, 30% keyword)
});Reciprocal Rank Fusion
Alternative fusion method:
import { reciprocalRankFusion } from '@localmode/core';
const fused = reciprocalRankFusion({
rankings: [semanticResults.map((r) => r.id), keywordResults.map((r) => documents[r.index].id)],
k: 10,
constant: 60, // RRF constant (default: 60)
});Complete RAG Pipeline
Here's a complete example:
import { createVectorDB, chunk, ingest, semanticSearch, rerank, streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';
import { webllm } from '@localmode/webllm';
// 1. Setup models
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const llm = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
// 2. Create database
const db = await createVectorDB({ name: 'knowledge-base', dimensions: 384 });
// 3. Ingest documents
async function ingestDocuments(documents: Array<{ text: string; source: string }>) {
for (const doc of documents) {
const chunks = chunk(doc.text, {
strategy: 'recursive',
size: 512,
overlap: 50,
});
await ingest({
db,
model: embeddingModel,
documents: chunks.map((c) => ({
text: c.text,
metadata: {
source: doc.source,
start: c.start,
end: c.end,
},
})),
});
}
}
// 4. Query function
async function query(question: string) {
// Retrieve
const results = await semanticSearch({
db,
model: embeddingModel,
query: question,
k: 10,
});
// Rerank
const { results: reranked } = await rerank({
model: rerankerModel,
query: question,
documents: results.map((r) => r.metadata.text as string),
topK: 3,
});
// Generate
const context = reranked.map((r) => r.text).join('\n\n---\n\n');
const stream = await streamText({
model: llm,
prompt: `You are a helpful assistant. Answer based only on the context provided.
If the answer is not in the context, say "I don't have that information."
Context:
${context}
Question: ${question}
Answer:`,
});
return stream;
}
// Usage
const stream = await query('What is machine learning?');
for await (const chunk of stream) {
process.stdout.write(chunk.text);
}Composable Pipelines
For a builder API that chains these steps with progress tracking and cancellation, see the Pipelines guide. Pre-built step factories like pipelineEmbedStep, pipelineSearchStep, and pipelineRerankStep let you assemble RAG workflows declaratively.
Document Loaders
Load documents from various formats:
import { TextLoader, JSONLoader, CSVLoader, HTMLLoader } from '@localmode/core';
import { PDFLoader } from '@localmode/pdfjs';
// Text files
const textLoader = new TextLoader();
const { documents: textDocs } = await textLoader.load(textBlob);
// JSON
const jsonLoader = new JSONLoader({ textField: 'content' });
const { documents: jsonDocs } = await jsonLoader.load(jsonBlob);
// CSV
const csvLoader = new CSVLoader({ textColumn: 'description' });
const { documents: csvDocs } = await csvLoader.load(csvBlob);
// HTML
const htmlLoader = new HTMLLoader({ selector: 'article' });
const { documents: htmlDocs } = await htmlLoader.load(htmlBlob);
// PDF
const pdfLoader = new PDFLoader({ splitByPage: true });
const { documents: pdfDocs } = await pdfLoader.load(pdfBlob);Streaming Semantic Search
Stream results progressively for real-time UI updates:
import { streamSemanticSearch } from '@localmode/core';
for await (const result of streamSemanticSearch({
db,
model,
query: 'authentication',
k: 100,
})) {
// Render each result as it arrives
console.log(result.id, result.score);
}streamSemanticSearch() accepts the same options as semanticSearch() but returns an AsyncGenerator that yields results one at a time.
Chunker Factories
Create reusable, pre-configured chunkers:
import {
createChunker,
createRecursiveChunker,
createMarkdownChunker,
createCodeChunker,
} from '@localmode/core';| Factory | Description |
|---|---|
createChunker(options) | Generic chunker with configurable strategy |
createRecursiveChunker(options) | Recursive text splitting with custom separators |
createMarkdownChunker(options) | Markdown-aware chunking that respects headings and structure |
createCodeChunker(options) | Code-aware chunking that respects functions and blocks |
// Create a reusable markdown chunker
const chunker = createMarkdownChunker({ size: 500, overlap: 50 });
// Use it on multiple documents
const chunks1 = chunker('# Document 1\n\nContent here...');
const chunks2 = chunker('# Document 2\n\nMore content...');Chunk Statistics
import { getChunkStats, estimateChunkCount } from '@localmode/core';
// Estimate how many chunks a text will produce
const count = estimateChunkCount(longText, { strategy: 'recursive', size: 500 });
console.log(`Will produce ~${count} chunks`);
// Get statistics from existing chunks
const stats = getChunkStats(chunks);
// { count, totalLength, avgLength, minLength, maxLength }Batch Processing & Ingestion Pipeline
For large-scale document processing, use the ingestion utilities:
chunkDocuments
Chunk multiple source documents at once:
import { chunkDocuments } from '@localmode/core';
const chunks = chunkDocuments(documents, {
chunking: { strategy: 'recursive', size: 500, overlap: 50 },
idPrefix: 'doc',
});
// Returns array of { id, text, sourceDocId, chunkIndex, start, end, metadata }ingest
Full ingestion pipeline — chunk, embed, and store in one call:
import { ingest } from '@localmode/core';
const result = await ingest(db, documents, {
chunking: { strategy: 'recursive', size: 500 },
generateEmbeddings: true,
embedder: async (texts) => {
const { embeddings } = await embedMany({ model, values: texts });
return embeddings;
},
batchSize: 50,
onProgress: (progress) => {
console.log(`${progress.phase}: ${progress.chunksProcessed}/${progress.totalChunks}`);
},
});createIngestPipeline
Create a reusable pipeline with default options:
import { createIngestPipeline } from '@localmode/core';
const pipeline = createIngestPipeline(db, {
chunking: { strategy: 'recursive', size: 500 },
generateEmbeddings: true,
embedder: myEmbedder,
});
// Ingest multiple batches with same config
await pipeline(batch1);
await pipeline(batch2);estimateIngestion
Estimate ingestion stats before processing:
import { estimateIngestion } from '@localmode/core';
const estimate = estimateIngestion(documents, {
chunking: { strategy: 'recursive', size: 500 },
});
console.log(`Documents: ${estimate.totalDocuments}`);
console.log(`Estimated chunks: ${estimate.estimatedChunks}`);
console.log(`Avg chunk size: ${estimate.avgChunkSize} chars`);Best Practices
RAG Best Practices
- Chunk size - 256-512 chars works well for most cases
- Overlap - 10-20% overlap helps maintain context
- Reranking - Always rerank for Q&A applications
- Hybrid search - Combine semantic + keyword for robust results
- Context window - Don't exceed LLM's context limit
- Use estimateIngestion - Check chunk counts before large ingestions
- Streaming search - Use
streamSemanticSearchfor progressive UI rendering
Next Steps
Document Loaders
Load documents from various formats.
Text Generation
Generate and stream text with LLMs.
WebLLM
Run LLMs locally in the browser.