← Back to Models

Specialized NLP Models Models in the Browser

DistilBART summarization, ModernBERT fill-mask, and DistilBERT QA - single-task NLP models optimized for browser inference.

Specialized NLP Models Models in the Browser

DistilBART summarization, ModernBERT fill-mask, and DistilBERT QA - single-task NLP models optimized for browser inference.

Overview

The Specialized NLP Models family is available through Transformers.js in LocalMode, with model sizes ranging from 65MB–360MB. The primary task for these models is summarization, and they can be used with any application built on the LocalMode SDK.

Running Specialized NLP Models models locally in the browser eliminates API costs, removes network latency, and keeps all user data on-device. After the initial model download, inference is instant and works offline. Each model variant targets a different trade-off between size, speed, and quality - choose based on your users' device capabilities and your application's requirements.

Architecture and History

Not every NLP task needs a general-purpose LLM. LocalMode includes three specialized models that outperform larger models on their specific tasks while using a fraction of the memory.

DistilBART-CNN-6-6 (284MB) is a 230M-parameter distilled version of BART-large (406M parameters) with 6 encoder and 6 decoder layers, fine-tuned on CNN/DailyMail. It generates abstractive summaries - rephrasing and condensing the source text rather than extracting sentences - at 2.09× the speed of BART-large-CNN. For document summarization, meeting notes, and article digests, it produces more natural results than prompting a 3B LLM to summarize.

ModernBERT-base (140MB) is a December 2024 encoder-only model with 149M parameters trained on 2 trillion tokens. It replaces classic BERT's 512-token limit with an 8,192-token context window, using Rotary Positional Embeddings and local-global attention. Its fill-mask capability is useful for text completion, data augmentation, and understanding model predictions. Given a sentence with a [MASK] token, it predicts the most likely word - useful for building autocomplete features and testing text understanding.

DistilBERT-SQuAD (65MB) is a 65.2M-parameter model fine-tuned on SQuAD v1.1 for extractive question answering: given a context passage and a question, it identifies the exact span of text that answers the question (F1 87.1 on the SQuAD v1.1 dev set). This is fundamentally different from generative QA (which LLMs do) - extractive QA is faster, more accurate for factual retrieval, and never hallucinates since it only returns text from the source.

Variant Comparison

The following table lists every Specialized NLP Models variant available through LocalMode, across all supported providers. Click a model ID to view its HuggingFace model card.

Model IDProviderSizeSpeedQualityContextDevice
Xenova/distilbart-cnn-6-6Transformers.js284MBMediumHigh-WASM
Xenova/distilbart-cnn-12-6Transformers.js360MBSlowHigh-WASM
onnx-community/ModernBERT-base-ONNXTransformers.js140MBMediumHigh8192WASM
Xenova/bert-base-uncasedTransformers.js96MBFastGood512WASM
Xenova/distilbert-base-cased-distilled-squadTransformers.js65MBFastGood-WASM

Size Distribution

Size RangeCount
300MB–400MB2variants
Under 200MB3variants

How to choose a variant: Start with the smallest model that meets your quality requirements. For prototyping and development, use the fastest variant (smallest size, "Fast" speed tier). For production, test your specific use case against 2–3 variants and measure the quality difference against user expectations. In many applications, users cannot distinguish between "Good" and "High" quality tiers - the smaller model saves download time and memory.

Provider-Specific Code Examples

All Specialized NLP Models variants use the same SummarizationModel interface from @localmode/core. Switching between providers requires changing only the import and model ID - no application logic changes.

Transformers.js

Transformers.js runs ONNX-optimized models via ONNX Runtime Web. WebGPU acceleration where available, WASM fallback otherwise.

import { transformers } from '@localmode/transformers';

const model = transformers.summarizer('Xenova/distilbart-cnn-6-6');
// Use the model with the corresponding @localmode/core function

Fallback Pattern

For maximum browser compatibility, wrap model loading in a try/catch: attempt the preferred model first, and fall back to a smaller variant if it fails to load.

import { transformers } from '@localmode/transformers';
import { summarize } from '@localmode/core';

// Try the higher-quality model, fall back to the smaller one on failure
let model;
try {
  model = transformers.summarizer('Xenova/distilbart-cnn-12-6');
} catch (error) {
  console.warn('Primary model failed, using fallback:', error);
  model = transformers.summarizer('Xenova/distilbart-cnn-6-6');
}

When to Use Specialized NLP Models

Specialized NLP Models models are a strong choice when:

  • You need summarization - Specialized NLP Models is optimized for summarization tasks with models across multiple size tiers.
  • Browser compatibility matters - Available through 1 provider (transformers), ensuring coverage across Chrome, Firefox, Safari, and Edge.
  • Size flexibility is important - The 65MB–360MB range means you can target everything from mobile devices to high-end desktops with the same model family.

HuggingFace Model Cards

Methodology

Model IDs and size figures were verified against packages/transformers/src/models.ts (the LocalMode source of truth) and confirmed against the ONNX file listings on each model's HuggingFace repository. For seq2seq models (DistilBART), the reported size is the combined quantized encoder + decoder footprint loaded at runtime. Parameter counts, benchmark scores (ROUGE, SQuAD F1), and context lengths were sourced from the official HuggingFace model cards linked above. Performance tiers (speed and quality) are LocalMode's curated assessments based on parameter count, quantization, and architecture - always benchmark on your target devices before production deployment.

Sources