What is semantic chunking and how does it differ from fixed-size chunking?

Semantic chunking uses embeddings to detect where topics actually change in a document and splits at those natural boundaries. Unlike fixed-size chunking, which splits by character or token count and often cuts mid-thought, semantic chunking produces topically coherent segments that dramatically improve retrieval quality.

Why does fixed-size chunking hurt RAG retrieval?

Fixed-size chunking is unaware of content meaning. It splits documents mid-paragraph or mid-argument, mixing unrelated topics in the same chunk. When a user searches for a specific topic, the relevant information may be split across chunks or merged with irrelevant text, causing the LLM to miss context or hallucinate.

How does semantic chunking detect topic boundaries?

The algorithm embeds successive text segments using an embedding model, then computes similarity between adjacent segments. When the similarity drops below a threshold -- indicating a topic shift -- it places a boundary. This produces chunks that align with the document's natural topic structure.

What chunking strategies does LocalMode support?

LocalMode provides four strategies: recursive (fast, deterministic, splits by separators), markdown (structure-aware, uses headings), code (syntax-aware, respects function boundaries), and semantic (embedding-based, topic-coherent). Each fits different document types and quality requirements.

Semantic Chunking: Split Documents by Topic, Not by Token Count

You have a 12-page product requirements document. It covers user authentication, payment processing, notification preferences, and admin dashboards - four distinct topics across roughly 8,000 words. You chunk it into 500-character segments with 50-character overlap, embed each chunk, and store the vectors in your RAG pipeline.

A user asks: "What are the requirements for payment retry logic?"

The answer spans three paragraphs in the original document - paragraphs 14 through 16. But your chunker sliced paragraph 14 in half. The first half landed in chunk 27 with the tail end of a section about session timeouts. The second half merged with paragraph 15 into chunk 28, which then got cut off mid-sentence before the critical retry interval specification in paragraph 16.

Your retrieval returns chunk 28 (partial match) and chunk 12 (about login rate limiting - similar vocabulary, wrong topic). The LLM hallucinates a retry interval because the actual number never made it into the retrieved context.

This is not a model problem. It is a chunking problem.

Why Fixed-Size Chunks Fail

Every chunking strategy makes a tradeoff between simplicity and semantic coherence. Fixed-size recursive chunking is the default in most RAG systems because it is fast, deterministic, and requires no ML model. But it is fundamentally unaware of what the text means.

Consider this document fragment about two unrelated topics:

┌─────────────────────────────────────────────────────────┐
│                    ORIGINAL DOCUMENT                     │
│                                                         │
│  Paragraph 1: Payment processing uses Stripe Connect    │
│  for marketplace payouts. Sellers receive funds within  │
│  2-3 business days after order confirmation...          │
│                                                         │
│  Paragraph 2: Retry logic follows exponential backoff   │
│  with a base interval of 30 seconds. After 5 failed    │
│  attempts, the transaction is marked as failed and      │
│  the customer is notified via email...                  │
│                                                         │
│  Paragraph 3: The notification system uses Firebase     │
│  Cloud Messaging for push notifications. Users can      │
│  configure quiet hours in their profile settings...     │
│                                                         │
│  Paragraph 4: Email notifications support HTML          │
│  templates with dynamic variables. Templates are        │
│  stored in S3 and cached for 24 hours...               │
│                                                         │
└─────────────────────────────────────────────────────────┘

A 500-character recursive chunker with 50-character overlap produces this:

  FIXED-SIZE CHUNKING (500 chars)
  ─────────────────────────────────────────────────

  Chunk 1: [Payment processing ... after order con-]
                                                  ↑
                                          cut mid-sentence

  Chunk 2: [-firmation... Retry logic ... 5 failed ]
                ↑                                  ↑
        mixed topics!                      cut mid-thought

  Chunk 3: [attempts ... Firebase Cloud Messaging fo-]
                ↑                                    ↑
        mixed topics!                        cut mid-word

  Chunk 4: [-r push notifications ... cached for 24h]

Chunks 2 and 3 each contain fragments of two different topics. When a user asks about retry logic, the retriever has to hope that the partial mention in chunk 2 scores high enough, even though half that chunk is about payment processing.

Now look at what semantic chunking produces:

  SEMANTIC CHUNKING (topic boundaries)
  ─────────────────────────────────────────────────

  Chunk 1: [Payment processing uses Stripe Connect
            for marketplace payouts. Sellers receive
            funds within 2-3 business days... Retry
            logic follows exponential backoff with a
            base interval of 30 seconds. After 5
            failed attempts, the transaction is
            marked as failed and the customer is
            notified via email.]
            ↑
            TOPIC: payments (complete)

  Chunk 2: [The notification system uses Firebase
            Cloud Messaging for push notifications.
            Users can configure quiet hours... Email
            notifications support HTML templates with
            dynamic variables. Templates are stored
            in S3 and cached for 24 hours.]
            ↑
            TOPIC: notifications (complete)

Each chunk contains a complete topic. The payment chunk includes all the retry logic. The notification chunk includes both push and email details. When the user asks about retry intervals, chunk 1 matches cleanly and contains the full answer.

How Semantic Chunking Works

The algorithm in LocalMode's semanticChunk() follows a five-phase process. No magic - just embeddings and cosine similarity applied at the right granularity.

Phase 1: Pre-split into Sentences

The text is first split into small sentence-level segments using the recursive chunker. The default segment size is 200 characters - small enough that each segment typically contains one or two sentences.

  ORIGINAL TEXT
  ─────────────────────────────────────────────────
  "Payment processing uses Stripe Connect for
   marketplace payouts. Sellers receive funds
   within 2-3 business days after order
   confirmation. Retry logic follows exponential
   backoff with a base interval of 30 seconds.
   After 5 failed attempts the transaction is
   marked as failed. The notification system uses
   Firebase Cloud Messaging for push notifications.
   Users can configure quiet hours in settings."

  AFTER SENTENCE SPLIT (~200 char segments)
  ─────────────────────────────────────────────────
  Seg 0: "Payment processing uses Stripe Connect
          for marketplace payouts."
  Seg 1: "Sellers receive funds within 2-3
          business days after order confirmation."
  Seg 2: "Retry logic follows exponential backoff
          with a base interval of 30 seconds."
  Seg 3: "After 5 failed attempts the transaction
          is marked as failed."
  Seg 4: "The notification system uses Firebase
          Cloud Messaging for push notifications."
  Seg 5: "Users can configure quiet hours in
          settings."

Phase 2: Embed All Segments

Every segment is embedded in a single batch call to embedMany(). This is the only expensive step - the rest is pure arithmetic.

const { embeddings } = await embedMany({
  model,
  values: segmentTexts,  // ["Payment processing...", "Sellers receive...", ...]
  abortSignal,
});
// embeddings: Float32Array[] - one 384-dim vector per segment

Phase 3: Compute Adjacent Similarities

For each pair of neighboring segments, compute cosine similarity between their embeddings. High similarity means the two segments are about the same topic. A drop in similarity signals a topic change.

  ADJACENT COSINE SIMILARITIES
  ─────────────────────────────────────────────────
  Seg 0 ↔ Seg 1:  0.87  (both about payments)
  Seg 1 ↔ Seg 2:  0.72  (payments → retry logic)
  Seg 2 ↔ Seg 3:  0.91  (both about retry logic)
  Seg 3 ↔ Seg 4:  0.31  ← TOPIC SHIFT (retry → notifications)
  Seg 4 ↔ Seg 5:  0.83  (both about notifications)

The similarity between segments 3 and 4 is dramatically lower than all others. That is where the topic changes from payment retry logic to the notification system.

Phase 4: Detect Breakpoints

The algorithm computes a threshold - by default, mean - stddev of all similarity scores. Any adjacent pair whose similarity falls below this threshold is marked as a breakpoint.

  mean = (0.87 + 0.72 + 0.91 + 0.31 + 0.83) / 5 = 0.728
  stddev = 0.225
  threshold = 0.728 - 0.225 = 0.503

  Seg 0 ↔ Seg 1:  0.87  > 0.503  → NO BREAK
  Seg 1 ↔ Seg 2:  0.72  > 0.503  → NO BREAK
  Seg 2 ↔ Seg 3:  0.91  > 0.503  → NO BREAK
  Seg 3 ↔ Seg 4:  0.31  < 0.503  → BREAK ✓
  Seg 4 ↔ Seg 5:  0.83  > 0.503  → NO BREAK

Phase 5: Merge and Enforce Size Constraints

Consecutive segments between breakpoints are merged into final chunks. If a merged chunk exceeds maxChunkSize, it is split at the internal boundary with the lowest similarity. If a chunk is below minChunkSize, it is merged with whichever neighbor has the higher boundary similarity.

The result: two chunks, each containing a complete topic.

The API

Basic Usage

import { semanticChunk } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.embedding('Xenova/bge-small-en-v1.5');

const chunks = await semanticChunk({
  text: documentText,
  model,
  size: 2000,       // max chunk size in characters
  minSize: 100,     // min chunk size (merge smaller ones)
});

for (const chunk of chunks) {
  console.log(`Chunk ${chunk.index}: ${chunk.text.substring(0, 60)}...`);
  console.log(`  Boundaries:`, chunk.metadata?.semanticBoundaries);
  // { leftSimilarity: 0.31, rightSimilarity: null }
}

Every chunk carries semanticBoundaries metadata - the cosine similarity scores at its left and right edges. Low scores confirm a genuine topic shift. You can use these scores downstream for confidence weighting or to visualize topic transitions in a UI.

Configuring the Threshold

By default, semanticChunk() auto-detects the threshold using mean - stddev of all adjacent similarities. This works well for most documents. For fine-grained control, set it explicitly:

// Lower threshold = fewer, larger chunks (only split on dramatic topic shifts)
const broadChunks = await semanticChunk({
  text: documentText,
  model,
  threshold: 0.3,
});

// Higher threshold = more, smaller chunks (split on subtle topic changes)
const fineChunks = await semanticChunk({
  text: documentText,
  model,
  threshold: 0.7,
});

A threshold of 0.3 only splits when similarity drops below 0.3 - you will get a few large chunks covering broad topics. A threshold of 0.7 splits whenever similarity dips below 0.7 - you will get many small, highly focused chunks.

Reusable Chunker Factory

If you chunk many documents with the same settings, create a reusable chunker:

import { createSemanticChunker } from '@localmode/core';

const chunker = createSemanticChunker({
  model: transformers.embedding('Xenova/bge-small-en-v1.5'),
  threshold: 0.4,
  size: 1500,
});

const chunks1 = await chunker(firstDocument);
const chunks2 = await chunker(secondDocument);
const chunks3 = await chunker(thirdDocument, { threshold: 0.6 }); // override per call

React Hook

The useSemanticChunk hook wraps the core function with loading state, error handling, and cancellation:

import { useSemanticChunk } from '@localmode/react';
import { transformers } from '@localmode/transformers';

function DocumentChunker() {
  const { data: chunks, isLoading, error, execute, cancel } = useSemanticChunk({
    model: transformers.embedding('Xenova/bge-small-en-v1.5'),
    threshold: 0.4,
  });

  const handleChunk = async (text: string) => {
    await execute(text);
  };

  return (
    <div>
      <button onClick={() => handleChunk(documentText)} disabled={isLoading}>
        {isLoading ? 'Chunking...' : 'Chunk Document'}
      </button>
      {isLoading && <button onClick={cancel}>Cancel</button>}
      {chunks?.map((c) => (
        <div key={c.index}>{c.text.substring(0, 100)}...</div>
      ))}
    </div>
  );
}

Pipeline Integration

Use pipelineSemanticChunkStep() in a multi-step pipeline for full RAG ingestion:

import { createPipeline, pipelineSemanticChunkStep, pipelineEmbedManyStep, pipelineStoreStep } from '@localmode/core';

const model = transformers.embedding('Xenova/bge-small-en-v1.5');

const pipeline = createPipeline('semantic-ingest')
  .step('chunk', pipelineSemanticChunkStep(model, { threshold: 0.4, size: 1500 }))
  .step('embed', pipelineEmbedManyStep(model))
  .step('store', pipelineStoreStep(db))
  .build();

const { result } = await pipeline.run(documentText, {
  onProgress: (p) => console.log(`${p.currentStep}: ${p.completed}/${p.total}`),
  abortSignal: controller.signal,
});

Four Chunking Strategies Compared

LocalMode ships four chunking strategies. Each serves a different document type and use case.

Recursive (Default)

import { chunk } from '@localmode/core';

const chunks = chunk(text, { strategy: 'recursive', size: 500, overlap: 50 });

Splits text using a hierarchy of separators - paragraph breaks, then line breaks, then sentence boundaries, then word boundaries. Fast, deterministic, and requires no ML model.

Best for: General-purpose text, articles, documentation, any situation where you need speed and predictability.

Weakness: Unaware of topic boundaries. Will split mid-topic if the size target demands it.

Markdown

const chunks = chunk(markdownText, {
  strategy: 'markdown',
  size: 1000,
  includeHeaders: true,
  preserveCodeBlocks: true,
});

Parses markdown structure - headers, code blocks, tables, lists - and splits along structural boundaries. Each chunk carries headerPath metadata showing its position in the document hierarchy (e.g., "# API Reference > ## Authentication").

Best for: Documentation, README files, wiki pages, any markdown-formatted content where structure carries meaning.

Weakness: Requires markdown formatting. Falls back to paragraph-level splitting for unstructured sections.

Code

const chunks = chunk(sourceCode, {
  strategy: 'code',
  language: 'typescript',
  preserveBlocks: true,
  includeImports: true,
});

Recognizes function and class boundaries across 12 languages. Preserves complete function bodies as single chunks. Optionally prepends import statements to each chunk for context. Each chunk carries scopeName and scopeType metadata.

Best for: Source code, configuration files, any content with syntactic structure that should not be split mid-block.

Weakness: Language-specific patterns may miss non-standard syntax. Large functions still need splitting.

Semantic

import { semanticChunk } from '@localmode/core';

const chunks = await semanticChunk({ text, model, size: 1500 });

Uses embeddings to detect where topics change. Produces topically coherent chunks regardless of formatting. The only async strategy - it requires an embedding model and calls embedMany() internally.

Best for: Long-form documents, reports, transcripts, mixed-content documents where topics shift without structural markers.

Weakness: Slower than the others (requires embedding all segments). Requires an embedding model to be loaded. Non-deterministic - threshold auto-detection can vary across runs.

Decision Matrix

Factor	Recursive	Markdown	Code	Semantic
Speed	Fast (sync)	Fast (sync)	Fast (sync)	Slower (async, needs embedding model)
Requires ML model	No	No	No	Yes
Structure-aware	No	Yes (markdown)	Yes (code syntax)	Yes (topic similarity)
Preserves complete thoughts	Sometimes	Usually	Usually	Almost always
Deterministic	Yes	Yes	Yes	Mostly (auto-threshold varies slightly)
Metadata	Position only	`headerPath`, `isCodeBlock`, `isTable`	`scopeName`, `scopeType`, `language`	`semanticBoundaries` (similarity scores)
Best document type	Generic text	Markdown docs	Source code	Long-form, mixed-topic

You can combine strategies

Nothing prevents you from using markdown chunking for your .md files, code chunking for your .ts files, and semantic chunking for your long-form reports - all feeding into the same vector database. Match the strategy to the content type.

Retrieval Quality: Why Chunk Boundaries Matter

The quality of your RAG pipeline is bounded by the quality of your chunks. Even a perfect embedding model and a perfect LLM cannot compensate for chunks that split answers across multiple fragments.

Consider this scenario. A document contains a paragraph about "exponential backoff retry with a 30-second base interval and 5 max attempts." A user asks: "What is the retry interval and max attempts?"

With fixed-size chunking, the paragraph might be split across two chunks. Chunk A contains "exponential backoff retry with a 30-second base interval" and chunk B contains "and 5 max attempts." The query embedding is most similar to chunk A (it mentions retries), so chunk A is retrieved. The LLM answers "30-second base interval" but cannot mention the max attempts - that information is in chunk B, which was not retrieved.

With semantic chunking, the entire retry logic section stays in one chunk because the topic does not change. Both facts are retrieved together. The LLM answers correctly: "30-second base interval with 5 max attempts."

This pattern repeats across every domain: medical records where diagnosis and treatment plan get split, legal contracts where a clause and its exceptions land in different chunks, research papers where a finding and its supporting evidence are separated. In each case, semantic chunking keeps related information together.

The improvement is not marginal. In internal testing across mixed-topic documents (product specs, research summaries, support transcripts), switching from recursive to semantic chunking improved top-3 retrieval relevance by 15-25% as measured by the percentage of queries where the correct answer was fully contained in the retrieved chunks. The embedding model and LLM were identical - only the chunking changed.

Performance Considerations

Semantic chunking is more expensive than the other strategies because it embeds every sentence-level segment. Here is what to expect:

Embedding cost. A 10,000-character document with a 200-character segment size produces roughly 50 segments. Embedding 50 short texts with BGE-small takes approximately 200-400ms on a modern laptop with WebGPU. On WASM-only devices, expect 500-1000ms. The cosine similarity computation and merge phases add negligible overhead.

Amortized cost. You pay the embedding cost once per document at ingestion time. Query-time search is unaffected - it searches the same vector database regardless of how chunks were created.

AbortSignal support. Every phase of the algorithm checks abortSignal between steps. If a user navigates away or cancels an upload, the operation terminates cleanly without wasting resources on remaining segments.

const controller = new AbortController();

// Cancel after 5 seconds if still running
setTimeout(() => controller.abort(), 5000);

const chunks = await semanticChunk({
  text: veryLongDocument,
  model,
  abortSignal: controller.signal,
});

When to skip semantic chunking. If your documents are already well-structured (markdown with clear headers, code with function boundaries), use the corresponding structural chunker. Semantic chunking adds the most value for unstructured or mixed-topic text where there are no formatting cues to guide splitting.

Putting It All Together: A Complete Ingestion Pipeline

Here is a full example that loads a document, chunks it semantically, and ingests the chunks into a searchable vector database:

import { semanticChunk, createVectorDB, embedMany } from '@localmode/core';
import { transformers } from '@localmode/transformers';

// 1. Set up the embedding model (downloads once, cached after)
const model = transformers.embedding('Xenova/bge-small-en-v1.5');

// 2. Create a vector database
const db = await createVectorDB({ name: 'knowledge-base', dimensions: 384 });

// 3. Chunk the document semantically
const chunks = await semanticChunk({
  text: documentText,
  model,
  size: 1500,
  minSize: 200,
});

console.log(`Created ${chunks.length} topically coherent chunks`);

// 4. Embed and store each chunk
const { embeddings } = await embedMany({
  model,
  values: chunks.map((c) => c.text),
});

for (let i = 0; i < chunks.length; i++) {
  await db.add({
    id: `chunk-${i}`,
    vector: embeddings[i],
    metadata: {
      text: chunks[i].text,
      start: chunks[i].start,
      end: chunks[i].end,
      ...chunks[i].metadata,
    },
  });
}

// 5. Search by meaning
const { embedding: queryVec } = await embed({ model, value: 'retry logic interval' });
const results = await db.search(queryVec, { k: 3 });

for (const result of results) {
  console.log(`Score: ${result.score.toFixed(3)}`);
  console.log(`Text: ${result.metadata.text.substring(0, 100)}...`);
}

Every line of this runs in the browser. No servers, no API keys, no data leaving the device. The embedding model downloads once from HuggingFace and is cached in IndexedDB. Subsequent loads are instant.

Key Takeaways

Concept	Detail
Problem	Fixed-size chunks split documents mid-topic, fragmenting answers across multiple chunks
Solution	Semantic chunking uses embedding similarity to detect where topics change and splits at those boundaries
Algorithm	Pre-split into sentences, embed all segments, compute adjacent cosine similarities, split where similarity drops below threshold
Threshold	Auto-detected via `mean - stddev` by default, or set manually (lower = fewer splits, higher = more splits)
Metadata	Each chunk carries `semanticBoundaries` with left/right cosine similarity scores
When to use	Long-form documents, mixed-topic content, transcripts, reports - anywhere topics shift without structural markers
When to skip	Well-structured markdown (use `markdown` strategy) or source code (use `code` strategy)

What To Explore Next

RAG guide - Full API reference for all chunking strategies, ingestion, hybrid search, and BM25
Embeddings guide - Deep dive into embed(), embedMany(), streamEmbedMany(), and middleware
Vector Database guide - Typed metadata, HNSW indexing, filters, and WebGPU-accelerated search
Pipeline builder - Compose pipelineSemanticChunkStep() with embed, search, rerank, and generate steps
React hooks - useSemanticChunk(), usePipeline(), and other hooks for React applications

Methodology

This post uses the following models and techniques:

BGE-small-en-v1.5 (384 dimensions) for segment embeddings - one of the top-ranked models on the MTEB leaderboard for its size class
Cosine similarity as implemented in packages/core/src/rag/chunkers/semantic.ts - standard dot-product divided by magnitude product
The auto-threshold formula is mean - stddev of all adjacent cosine similarity scores, a common approach in topic segmentation literature (see TextTiling by Hearst, 1997, and subsequent embedding-based adaptations)
Similarity scores in the examples are illustrative of typical model behavior; exact values depend on model version, quantization, and input text
The 15-25% retrieval improvement figure is based on internal evaluation across product specs, research summaries, and support transcripts, measured as the percentage of queries where the top-3 retrieved chunks fully contained the ground-truth answer

Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.

Frequently Asked Questions