LocalMode
Core

Evaluation

Measure model quality with pure math metric functions and a model evaluation orchestrator.

Evaluate model quality entirely in the browser. Compare embedding models, score classifiers, measure retrieval accuracy, and assess text generation quality -- all offline, all private.

See it in action

Try Model Evaluator for a working demo of these APIs.

All metric functions are pure math with zero dependencies. They run synchronously and are suitable for datasets of any practical size in-browser.

Classification Metrics

accuracy()

Compute the fraction of correct predictions:

import { accuracy } from '@localmode/core';

const score = accuracy(
  ['cat', 'dog', 'bird', 'fish'],
  ['cat', 'dog', 'fish', 'bird'],
);
// score === 0.5

Prop

Type

precision()

Compute macro-averaged precision across all classes:

import { precision } from '@localmode/core';

const score = precision(['pos', 'neg'], ['pos', 'neg']);
// score === 1.0

For each class, computes TP / (TP + FP), then averages. Classes with zero predictions have precision 0.

recall()

Compute macro-averaged recall across all classes:

import { recall } from '@localmode/core';

const score = recall(['pos', 'neg'], ['pos', 'neg']);
// score === 1.0

For each class, computes TP / (TP + FN), then averages. Classes with zero actual instances have recall 0.

f1Score()

Compute F1 score with configurable averaging:

import { f1Score } from '@localmode/core';

const score = f1Score(['cat', 'dog'], ['cat', 'dog']);
// score === 1.0 (macro average of per-class F1)
import { f1Score } from '@localmode/core';

// Micro F1 equals accuracy for single-label classification
const score = f1Score(predictions, labels, { average: 'micro' });
import { f1Score } from '@localmode/core';

// Weighted by class support (number of true instances)
const score = f1Score(predictions, labels, { average: 'weighted' });

Prop

Type

confusionMatrix()

Build a structured confusion matrix with helper methods:

import { confusionMatrix } from '@localmode/core';

const cm = confusionMatrix(
  ['pos', 'neg', 'pos', 'pos'],
  ['pos', 'pos', 'neg', 'pos'],
);

console.log(cm.labels);               // ['neg', 'pos']
console.log(cm.matrix);               // 2D count array
console.log(cm.truePositives('pos'));  // 2
console.log(cm.falsePositives('pos')); // 1
console.log(cm.trueNegatives('pos')); // 0
console.log(cm.falseNegatives('pos'));// 1

Prop

Type

Text Generation Metrics

bleuScore()

Compute BLEU-4 score for text generation evaluation:

import { bleuScore } from '@localmode/core';

const score = bleuScore(
  'the cat sat on the mat',
  ['the cat sat on the mat'],
);
// score === 1.0

// Multiple references (uses max n-gram count from any reference)
const multi = bleuScore(candidate, [ref1, ref2, ref3]);

Prop

Type

Uses whitespace tokenization. Scores are directionally correct for model comparison but may differ from NLTK BLEU absolute values.

rougeScore()

Compute ROUGE score for summarization evaluation:

import { rougeScore } from '@localmode/core';

const score = rougeScore('the cat sat', 'the cat sat');
// score === 1.0 (unigram overlap F1)
import { rougeScore } from '@localmode/core';

const score = rougeScore(
  'the cat sat on the mat',
  'the cat sat on a mat',
  { type: 'rouge-2' },
);
// Bigram overlap F1
import { rougeScore } from '@localmode/core';

const score = rougeScore(
  'the cat is on the mat',
  'the cat sat on the mat',
  { type: 'rouge-l' },
);
// Longest common subsequence F1

Prop

Type

Retrieval Metrics

mrr()

Compute Mean Reciprocal Rank for retrieval evaluation:

import { mrr } from '@localmode/core';

const score = mrr(
  [['a', 'b', 'c'], ['d', 'e', 'f']],
  [['b'], ['f']],
);
// score === (1/2 + 1/3) / 2

Prop

Type

ndcg()

Compute Normalized Discounted Cumulative Gain:

import { ndcg } from '@localmode/core';

// Perfect ranking
const score = ndcg(['a', 'b', 'c'], { a: 3, b: 2, c: 1 });
// score === 1.0

// NDCG at k=5
const atK = ndcg(rankedResults, relevanceScores, 5);

Prop

Type

Vector Quality Metrics

evalCosineDistance()

Compute cosine distance between two vectors:

import { evalCosineDistance } from '@localmode/core';

const dist = evalCosineDistance(
  new Float32Array([1, 0, 0]),
  new Float32Array([1, 0, 0]),
);
// dist === 0.0 (identical direction)

Returns a value between 0 (identical) and 2 (opposite). Returns 1.0 for zero vectors.

Exported as evalCosineDistance to avoid naming conflict with the HNSW cosineDistance function also exported from @localmode/core.

Model Evaluation Orchestrator

evaluateModel()

Run a model against a dataset, apply a metric, and get a structured report:

import { evaluateModel, accuracy } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.classifier('Xenova/distilbert-base-uncased-finetuned-sst-2-english');

const result = await evaluateModel({
  dataset: {
    inputs: ['great movie', 'terrible film', 'okay show'],
    expected: ['POSITIVE', 'NEGATIVE', 'NEGATIVE'],
  },
  predict: async (text, signal) => {
    const { classify } = await import('@localmode/core');
    const { label } = await classify({ model, text, abortSignal: signal });
    return label;
  },
  metric: accuracy,
});

console.log(result.score);       // 0.67
console.log(result.predictions); // ['POSITIVE', 'NEGATIVE', 'POSITIVE']
console.log(result.datasetSize); // 3
console.log(result.durationMs);  // 1234
const result = await evaluateModel({
  dataset,
  predict,
  metric: f1Score,
  onProgress: (completed, total) => {
    console.log(`Evaluating: ${completed}/${total}`);
  },
});
const controller = new AbortController();

// Cancel after 10 seconds
setTimeout(() => controller.abort(), 10_000);

try {
  const result = await evaluateModel({
    dataset,
    predict,
    metric: accuracy,
    abortSignal: controller.signal,
  });
} catch (error) {
  console.log('Evaluation cancelled');
}

EvaluateModelOptions

Prop

Type

EvaluateModelResult

Prop

Type

Custom Metrics

Create custom metric functions matching the MetricFunction type:

import type { MetricFunction } from '@localmode/core';
import { evaluateModel } from '@localmode/core';

// Custom metric: exact match ignoring case
const caseInsensitiveAccuracy: MetricFunction<string, string> = (predictions, labels) => {
  let correct = 0;
  for (let i = 0; i < predictions.length; i++) {
    if (predictions[i].toLowerCase() === labels[i].toLowerCase()) correct++;
  }
  return correct / predictions.length;
};

const result = await evaluateModel({
  dataset,
  predict,
  metric: caseInsensitiveAccuracy,
});

React Hook

useEvaluateModel()

Wraps evaluateModel() with loading/error/cancel state:

import { useEvaluateModel } from '@localmode/react';
import { accuracy } from '@localmode/core';

function EvalPanel() {
  const { data, isLoading, error, execute, cancel, reset } = useEvaluateModel();

  const runEval = () =>
    execute({
      dataset: { inputs: texts, expected: labels },
      predict: async (text, signal) => {
        const { classify } = await import('@localmode/core');
        const { label } = await classify({ model, text, abortSignal: signal });
        return label;
      },
      metric: accuracy,
    });

  return (
    <div>
      <button onClick={runEval} disabled={isLoading}>Evaluate</button>
      {isLoading && <button onClick={cancel}>Cancel</button>}
      {data && <p>Score: {data.score.toFixed(3)}</p>}
      {error && <p>Error: {error.message}</p>}
    </div>
  );
}

Error Handling

All metric functions throw ValidationError on invalid inputs:

import { accuracy } from '@localmode/core';

// Empty arrays
accuracy([], []);
// => ValidationError: accuracy requires at least one prediction

// Mismatched lengths
accuracy(['a', 'b'], ['a']);
// => ValidationError: accuracy requires predictions and labels to have equal length

// Each error includes a hint:
// hint: "Ensure predictions and labels arrays have the same length."

Limitations

  • Whitespace tokenization: bleuScore() and rougeScore() split on whitespace. No stemming, stopword removal, or language-specific tokenization.
  • Simplified BLEU: Implements BLEU-4 with brevity penalty. Scores are directionally correct but may differ from NLTK BLEU.
  • Simplified ROUGE: ROUGE-1, ROUGE-2, and ROUGE-L only. No stemming or stopword removal.
  • Synchronous metrics: Metric functions are synchronous (pure math). For very large datasets (100K+ items), consider running in a Web Worker.

Showcase Apps

AppDescriptionLinks
Model EvaluatorEvaluate model accuracy, precision, recall, and F1Demo · Source

On this page