Semantic Chunking: Split Documents by Topic, Not by Token Count
Fixed-size chunking splits documents mid-thought, mid-paragraph, mid-argument. Semantic chunking uses embeddings to detect where topics actually change and splits at those boundaries - producing chunks that are topically coherent and dramatically better for retrieval. A deep dive into the algorithm, the API, and when to use each of LocalMode's four chunking strategies.
You have a 12-page product requirements document. It covers user authentication, payment processing, notification preferences, and admin dashboards - four distinct topics across roughly 8,000 words. You chunk it into 500-character segments with 50-character overlap, embed each chunk, and store the vectors in your RAG pipeline.
A user asks: "What are the requirements for payment retry logic?"
The answer spans three paragraphs in the original document - paragraphs 14 through 16. But your chunker sliced paragraph 14 in half. The first half landed in chunk 27 with the tail end of a section about session timeouts. The second half merged with paragraph 15 into chunk 28, which then got cut off mid-sentence before the critical retry interval specification in paragraph 16.
Your retrieval returns chunk 28 (partial match) and chunk 12 (about login rate limiting - similar vocabulary, wrong topic). The LLM hallucinates a retry interval because the actual number never made it into the retrieved context.
This is not a model problem. It is a chunking problem.
Why Fixed-Size Chunks Fail
Every chunking strategy makes a tradeoff between simplicity and semantic coherence. Fixed-size recursive chunking is the default in most RAG systems because it is fast, deterministic, and requires no ML model. But it is fundamentally unaware of what the text means.
Consider this document fragment about two unrelated topics:
┌─────────────────────────────────────────────────────────┐
│ ORIGINAL DOCUMENT │
│ │
│ Paragraph 1: Payment processing uses Stripe Connect │
│ for marketplace payouts. Sellers receive funds within │
│ 2-3 business days after order confirmation... │
│ │
│ Paragraph 2: Retry logic follows exponential backoff │
│ with a base interval of 30 seconds. After 5 failed │
│ attempts, the transaction is marked as failed and │
│ the customer is notified via email... │
│ │
│ Paragraph 3: The notification system uses Firebase │
│ Cloud Messaging for push notifications. Users can │
│ configure quiet hours in their profile settings... │
│ │
│ Paragraph 4: Email notifications support HTML │
│ templates with dynamic variables. Templates are │
│ stored in S3 and cached for 24 hours... │
│ │
└─────────────────────────────────────────────────────────┘A 500-character recursive chunker with 50-character overlap produces this:
FIXED-SIZE CHUNKING (500 chars)
─────────────────────────────────────────────────
Chunk 1: [Payment processing ... after order con-]
↑
cut mid-sentence
Chunk 2: [-firmation... Retry logic ... 5 failed ]
↑ ↑
mixed topics! cut mid-thought
Chunk 3: [attempts ... Firebase Cloud Messaging fo-]
↑ ↑
mixed topics! cut mid-word
Chunk 4: [-r push notifications ... cached for 24h]Chunks 2 and 3 each contain fragments of two different topics. When a user asks about retry logic, the retriever has to hope that the partial mention in chunk 2 scores high enough, even though half that chunk is about payment processing.
Now look at what semantic chunking produces:
SEMANTIC CHUNKING (topic boundaries)
─────────────────────────────────────────────────
Chunk 1: [Payment processing uses Stripe Connect
for marketplace payouts. Sellers receive
funds within 2-3 business days... Retry
logic follows exponential backoff with a
base interval of 30 seconds. After 5
failed attempts, the transaction is
marked as failed and the customer is
notified via email.]
↑
TOPIC: payments (complete)
Chunk 2: [The notification system uses Firebase
Cloud Messaging for push notifications.
Users can configure quiet hours... Email
notifications support HTML templates with
dynamic variables. Templates are stored
in S3 and cached for 24 hours.]
↑
TOPIC: notifications (complete)Each chunk contains a complete topic. The payment chunk includes all the retry logic. The notification chunk includes both push and email details. When the user asks about retry intervals, chunk 1 matches cleanly and contains the full answer.
How Semantic Chunking Works
The algorithm in LocalMode's semanticChunk() follows a five-phase process. No magic - just embeddings and cosine similarity applied at the right granularity.
Phase 1: Pre-split into Sentences
The text is first split into small sentence-level segments using the recursive chunker. The default segment size is 200 characters - small enough that each segment typically contains one or two sentences.
ORIGINAL TEXT
─────────────────────────────────────────────────
"Payment processing uses Stripe Connect for
marketplace payouts. Sellers receive funds
within 2-3 business days after order
confirmation. Retry logic follows exponential
backoff with a base interval of 30 seconds.
After 5 failed attempts the transaction is
marked as failed. The notification system uses
Firebase Cloud Messaging for push notifications.
Users can configure quiet hours in settings."
AFTER SENTENCE SPLIT (~200 char segments)
─────────────────────────────────────────────────
Seg 0: "Payment processing uses Stripe Connect
for marketplace payouts."
Seg 1: "Sellers receive funds within 2-3
business days after order confirmation."
Seg 2: "Retry logic follows exponential backoff
with a base interval of 30 seconds."
Seg 3: "After 5 failed attempts the transaction
is marked as failed."
Seg 4: "The notification system uses Firebase
Cloud Messaging for push notifications."
Seg 5: "Users can configure quiet hours in
settings."Phase 2: Embed All Segments
Every segment is embedded in a single batch call to embedMany(). This is the only expensive step - the rest is pure arithmetic.
const { embeddings } = await embedMany({
model,
values: segmentTexts, // ["Payment processing...", "Sellers receive...", ...]
abortSignal,
});
// embeddings: Float32Array[] - one 384-dim vector per segmentPhase 3: Compute Adjacent Similarities
For each pair of neighboring segments, compute cosine similarity between their embeddings. High similarity means the two segments are about the same topic. A drop in similarity signals a topic change.
ADJACENT COSINE SIMILARITIES
─────────────────────────────────────────────────
Seg 0 ↔ Seg 1: 0.87 (both about payments)
Seg 1 ↔ Seg 2: 0.72 (payments → retry logic)
Seg 2 ↔ Seg 3: 0.91 (both about retry logic)
Seg 3 ↔ Seg 4: 0.31 ← TOPIC SHIFT (retry → notifications)
Seg 4 ↔ Seg 5: 0.83 (both about notifications)The similarity between segments 3 and 4 is dramatically lower than all others. That is where the topic changes from payment retry logic to the notification system.
Phase 4: Detect Breakpoints
The algorithm computes a threshold - by default, mean - stddev of all similarity scores. Any adjacent pair whose similarity falls below this threshold is marked as a breakpoint.
mean = (0.87 + 0.72 + 0.91 + 0.31 + 0.83) / 5 = 0.728
stddev = 0.225
threshold = 0.728 - 0.225 = 0.503
Seg 0 ↔ Seg 1: 0.87 > 0.503 → NO BREAK
Seg 1 ↔ Seg 2: 0.72 > 0.503 → NO BREAK
Seg 2 ↔ Seg 3: 0.91 > 0.503 → NO BREAK
Seg 3 ↔ Seg 4: 0.31 < 0.503 → BREAK ✓
Seg 4 ↔ Seg 5: 0.83 > 0.503 → NO BREAKPhase 5: Merge and Enforce Size Constraints
Consecutive segments between breakpoints are merged into final chunks. If a merged chunk exceeds maxChunkSize, it is split at the internal boundary with the lowest similarity. If a chunk is below minChunkSize, it is merged with whichever neighbor has the higher boundary similarity.
The result: two chunks, each containing a complete topic.
The API
Basic Usage
import { semanticChunk } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.embedding('Xenova/bge-small-en-v1.5');
const chunks = await semanticChunk({
text: documentText,
model,
size: 2000, // max chunk size in characters
minSize: 100, // min chunk size (merge smaller ones)
});
for (const chunk of chunks) {
console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 60)}...`);
console.log(` Boundaries:`, chunk.metadata?.semanticBoundaries);
// { leftSimilarity: 0.31, rightSimilarity: null }
}Every chunk carries semanticBoundaries metadata - the cosine similarity scores at its left and right edges. Low scores confirm a genuine topic shift. You can use these scores downstream for confidence weighting or to visualize topic transitions in a UI.
Configuring the Threshold
By default, semanticChunk() auto-detects the threshold using mean - stddev of all adjacent similarities. This works well for most documents. For fine-grained control, set it explicitly:
// Lower threshold = fewer, larger chunks (only split on dramatic topic shifts)
const broadChunks = await semanticChunk({
text: documentText,
model,
threshold: 0.3,
});
// Higher threshold = more, smaller chunks (split on subtle topic changes)
const fineChunks = await semanticChunk({
text: documentText,
model,
threshold: 0.7,
});A threshold of 0.3 only splits when similarity drops below 0.3 - you will get a few large chunks covering broad topics. A threshold of 0.7 splits whenever similarity dips below 0.7 - you will get many small, highly focused chunks.
Reusable Chunker Factory
If you chunk many documents with the same settings, create a reusable chunker:
import { createSemanticChunker } from '@localmode/core';
const chunker = createSemanticChunker({
model: transformers.embedding('Xenova/bge-small-en-v1.5'),
threshold: 0.4,
size: 1500,
});
const chunks1 = await chunker(firstDocument);
const chunks2 = await chunker(secondDocument);
const chunks3 = await chunker(thirdDocument, { threshold: 0.6 }); // override per callReact Hook
The useSemanticChunk hook wraps the core function with loading state, error handling, and cancellation:
import { useSemanticChunk } from '@localmode/react';
import { transformers } from '@localmode/transformers';
function DocumentChunker() {
const { data: chunks, isLoading, error, execute, cancel } = useSemanticChunk({
model: transformers.embedding('Xenova/bge-small-en-v1.5'),
threshold: 0.4,
});
const handleChunk = async (text: string) => {
await execute(text);
};
return (
<div>
<button onClick={() => handleChunk(documentText)} disabled={isLoading}>
{isLoading ? 'Chunking...' : 'Chunk Document'}
</button>
{isLoading && <button onClick={cancel}>Cancel</button>}
{chunks?.map((c) => (
<div key={c.index}>{c.text.substring(0, 100)}...</div>
))}
</div>
);
}Pipeline Integration
Use pipelineSemanticChunkStep() in a multi-step pipeline for full RAG ingestion:
import { createPipeline, pipelineSemanticChunkStep, pipelineEmbedManyStep, pipelineStoreStep } from '@localmode/core';
const model = transformers.embedding('Xenova/bge-small-en-v1.5');
const pipeline = createPipeline('semantic-ingest')
.step('chunk', pipelineSemanticChunkStep(model, { threshold: 0.4, size: 1500 }))
.step('embed', pipelineEmbedManyStep(model))
.step('store', pipelineStoreStep(db))
.build();
const { result } = await pipeline.run(documentText, {
onProgress: (p) => console.log(`${p.currentStep}: ${p.completed}/${p.total}`),
abortSignal: controller.signal,
});Four Chunking Strategies Compared
LocalMode ships four chunking strategies. Each serves a different document type and use case.
Recursive (Default)
import { chunk } from '@localmode/core';
const chunks = chunk(text, { strategy: 'recursive', size: 500, overlap: 50 });Splits text using a hierarchy of separators - paragraph breaks, then line breaks, then sentence boundaries, then word boundaries. Fast, deterministic, and requires no ML model.
Best for: General-purpose text, articles, documentation, any situation where you need speed and predictability.
Weakness: Unaware of topic boundaries. Will split mid-topic if the size target demands it.
Markdown
const chunks = chunk(markdownText, {
strategy: 'markdown',
size: 1000,
includeHeaders: true,
preserveCodeBlocks: true,
});Parses markdown structure - headers, code blocks, tables, lists - and splits along structural boundaries. Each chunk carries headerPath metadata showing its position in the document hierarchy (e.g., "# API Reference > ## Authentication").
Best for: Documentation, README files, wiki pages, any markdown-formatted content where structure carries meaning.
Weakness: Requires markdown formatting. Falls back to paragraph-level splitting for unstructured sections.
Code
const chunks = chunk(sourceCode, {
strategy: 'code',
language: 'typescript',
preserveBlocks: true,
includeImports: true,
});Recognizes function and class boundaries across 12 languages. Preserves complete function bodies as single chunks. Optionally prepends import statements to each chunk for context. Each chunk carries scopeName and scopeType metadata.
Best for: Source code, configuration files, any content with syntactic structure that should not be split mid-block.
Weakness: Language-specific patterns may miss non-standard syntax. Large functions still need splitting.
Semantic
import { semanticChunk } from '@localmode/core';
const chunks = await semanticChunk({ text, model, size: 1500 });Uses embeddings to detect where topics change. Produces topically coherent chunks regardless of formatting. The only async strategy - it requires an embedding model and calls embedMany() internally.
Best for: Long-form documents, reports, transcripts, mixed-content documents where topics shift without structural markers.
Weakness: Slower than the others (requires embedding all segments). Requires an embedding model to be loaded. Non-deterministic - threshold auto-detection can vary across runs.
Decision Matrix
| Factor | Recursive | Markdown | Code | Semantic |
|---|---|---|---|---|
| Speed | Fast (sync) | Fast (sync) | Fast (sync) | Slower (async, needs embedding model) |
| Requires ML model | No | No | No | Yes |
| Structure-aware | No | Yes (markdown) | Yes (code syntax) | Yes (topic similarity) |
| Preserves complete thoughts | Sometimes | Usually | Usually | Almost always |
| Deterministic | Yes | Yes | Yes | Mostly (auto-threshold varies slightly) |
| Metadata | Position only | headerPath, isCodeBlock, isTable | scopeName, scopeType, language | semanticBoundaries (similarity scores) |
| Best document type | Generic text | Markdown docs | Source code | Long-form, mixed-topic |
You can combine strategies
Nothing prevents you from using markdown chunking for your .md files, code chunking for your .ts files, and semantic chunking for your long-form reports - all feeding into the same vector database. Match the strategy to the content type.
Retrieval Quality: Why Chunk Boundaries Matter
The quality of your RAG pipeline is bounded by the quality of your chunks. Even a perfect embedding model and a perfect LLM cannot compensate for chunks that split answers across multiple fragments.
Consider this scenario. A document contains a paragraph about "exponential backoff retry with a 30-second base interval and 5 max attempts." A user asks: "What is the retry interval and max attempts?"
With fixed-size chunking, the paragraph might be split across two chunks. Chunk A contains "exponential backoff retry with a 30-second base interval" and chunk B contains "and 5 max attempts." The query embedding is most similar to chunk A (it mentions retries), so chunk A is retrieved. The LLM answers "30-second base interval" but cannot mention the max attempts - that information is in chunk B, which was not retrieved.
With semantic chunking, the entire retry logic section stays in one chunk because the topic does not change. Both facts are retrieved together. The LLM answers correctly: "30-second base interval with 5 max attempts."
This pattern repeats across every domain: medical records where diagnosis and treatment plan get split, legal contracts where a clause and its exceptions land in different chunks, research papers where a finding and its supporting evidence are separated. In each case, semantic chunking keeps related information together.
The improvement is not marginal. In internal testing across mixed-topic documents (product specs, research summaries, support transcripts), switching from recursive to semantic chunking improved top-3 retrieval relevance by 15-25% as measured by the percentage of queries where the correct answer was fully contained in the retrieved chunks. The embedding model and LLM were identical - only the chunking changed.
Performance Considerations
Semantic chunking is more expensive than the other strategies because it embeds every sentence-level segment. Here is what to expect:
Embedding cost. A 10,000-character document with a 200-character segment size produces roughly 50 segments. Embedding 50 short texts with BGE-small takes approximately 200-400ms on a modern laptop with WebGPU. On WASM-only devices, expect 500-1000ms. The cosine similarity computation and merge phases add negligible overhead.
Amortized cost. You pay the embedding cost once per document at ingestion time. Query-time search is unaffected - it searches the same vector database regardless of how chunks were created.
AbortSignal support. Every phase of the algorithm checks abortSignal between steps. If a user navigates away or cancels an upload, the operation terminates cleanly without wasting resources on remaining segments.
const controller = new AbortController();
// Cancel after 5 seconds if still running
setTimeout(() => controller.abort(), 5000);
const chunks = await semanticChunk({
text: veryLongDocument,
model,
abortSignal: controller.signal,
});When to skip semantic chunking. If your documents are already well-structured (markdown with clear headers, code with function boundaries), use the corresponding structural chunker. Semantic chunking adds the most value for unstructured or mixed-topic text where there are no formatting cues to guide splitting.
Putting It All Together: A Complete Ingestion Pipeline
Here is a full example that loads a document, chunks it semantically, and ingests the chunks into a searchable vector database:
import { semanticChunk, createVectorDB, embedMany } from '@localmode/core';
import { transformers } from '@localmode/transformers';
// 1. Set up the embedding model (downloads once, cached after)
const model = transformers.embedding('Xenova/bge-small-en-v1.5');
// 2. Create a vector database
const db = await createVectorDB({ name: 'knowledge-base', dimensions: 384 });
// 3. Chunk the document semantically
const chunks = await semanticChunk({
text: documentText,
model,
size: 1500,
minSize: 200,
});
console.log(`Created ${chunks.length} topically coherent chunks`);
// 4. Embed and store each chunk
const { embeddings } = await embedMany({
model,
values: chunks.map((c) => c.text),
});
for (let i = 0; i < chunks.length; i++) {
await db.add({
id: `chunk-${i}`,
vector: embeddings[i],
metadata: {
text: chunks[i].text,
start: chunks[i].start,
end: chunks[i].end,
...chunks[i].metadata,
},
});
}
// 5. Search by meaning
const { embedding: queryVec } = await embed({ model, value: 'retry logic interval' });
const results = await db.search(queryVec, { k: 3 });
for (const result of results) {
console.log(`Score: ${result.score.toFixed(3)}`);
console.log(`Text: ${result.metadata.text.substring(0, 100)}...`);
}Every line of this runs in the browser. No servers, no API keys, no data leaving the device. The embedding model downloads once from HuggingFace and is cached in IndexedDB. Subsequent loads are instant.
Key Takeaways
| Concept | Detail |
|---|---|
| Problem | Fixed-size chunks split documents mid-topic, fragmenting answers across multiple chunks |
| Solution | Semantic chunking uses embedding similarity to detect where topics change and splits at those boundaries |
| Algorithm | Pre-split into sentences, embed all segments, compute adjacent cosine similarities, split where similarity drops below threshold |
| Threshold | Auto-detected via mean - stddev by default, or set manually (lower = fewer splits, higher = more splits) |
| Metadata | Each chunk carries semanticBoundaries with left/right cosine similarity scores |
| When to use | Long-form documents, mixed-topic content, transcripts, reports - anywhere topics shift without structural markers |
| When to skip | Well-structured markdown (use markdown strategy) or source code (use code strategy) |
What To Explore Next
- RAG guide - Full API reference for all chunking strategies, ingestion, hybrid search, and BM25
- Embeddings guide - Deep dive into
embed(),embedMany(),streamEmbedMany(), and middleware - Vector Database guide - Typed metadata, HNSW indexing, filters, and WebGPU-accelerated search
- Pipeline builder - Compose
pipelineSemanticChunkStep()with embed, search, rerank, and generate steps - React hooks -
useSemanticChunk(),usePipeline(), and other hooks for React applications
Methodology
This post uses the following models and techniques:
- BGE-small-en-v1.5 (384 dimensions) for segment embeddings - one of the top-ranked models on the MTEB leaderboard for its size class
- Cosine similarity as implemented in
packages/core/src/rag/chunkers/semantic.ts- standard dot-product divided by magnitude product - The auto-threshold formula is
mean - stddevof all adjacent cosine similarity scores, a common approach in topic segmentation literature (see TextTiling by Hearst, 1997, and subsequent embedding-based adaptations) - Similarity scores in the examples are illustrative of typical model behavior; exact values depend on model version, quantization, and input text
- The 15-25% retrieval improvement figure is based on internal evaluation across product specs, research summaries, and support transcripts, measured as the percentage of queries where the top-3 retrieved chunks fully contained the ground-truth answer
Try it yourself
Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.
Read the Getting Started guide to add local AI to your application in under 5 minutes.