Evaluation
Measure model quality with pure math metric functions and a model evaluation orchestrator.
Evaluate model quality entirely in the browser. Compare embedding models, score classifiers, measure retrieval accuracy, and assess text generation quality -- all offline, all private.
See it in action
Try Model Evaluator for a working demo of these APIs.
All metric functions are pure math with zero dependencies. They run synchronously and are suitable for datasets of any practical size in-browser.
Classification Metrics
accuracy()
Compute the fraction of correct predictions:
import { accuracy } from '@localmode/core';
const score = accuracy(
['cat', 'dog', 'bird', 'fish'],
['cat', 'dog', 'fish', 'bird'],
);
// score === 0.5Prop
Type
precision()
Compute macro-averaged precision across all classes:
import { precision } from '@localmode/core';
const score = precision(['pos', 'neg'], ['pos', 'neg']);
// score === 1.0For each class, computes TP / (TP + FP), then averages. Classes with zero predictions have precision 0.
recall()
Compute macro-averaged recall across all classes:
import { recall } from '@localmode/core';
const score = recall(['pos', 'neg'], ['pos', 'neg']);
// score === 1.0For each class, computes TP / (TP + FN), then averages. Classes with zero actual instances have recall 0.
f1Score()
Compute F1 score with configurable averaging:
import { f1Score } from '@localmode/core';
const score = f1Score(['cat', 'dog'], ['cat', 'dog']);
// score === 1.0 (macro average of per-class F1)import { f1Score } from '@localmode/core';
// Micro F1 equals accuracy for single-label classification
const score = f1Score(predictions, labels, { average: 'micro' });import { f1Score } from '@localmode/core';
// Weighted by class support (number of true instances)
const score = f1Score(predictions, labels, { average: 'weighted' });Prop
Type
confusionMatrix()
Build a structured confusion matrix with helper methods:
import { confusionMatrix } from '@localmode/core';
const cm = confusionMatrix(
['pos', 'neg', 'pos', 'pos'],
['pos', 'pos', 'neg', 'pos'],
);
console.log(cm.labels); // ['neg', 'pos']
console.log(cm.matrix); // 2D count array
console.log(cm.truePositives('pos')); // 2
console.log(cm.falsePositives('pos')); // 1
console.log(cm.trueNegatives('pos')); // 0
console.log(cm.falseNegatives('pos'));// 1Prop
Type
Text Generation Metrics
bleuScore()
Compute BLEU-4 score for text generation evaluation:
import { bleuScore } from '@localmode/core';
const score = bleuScore(
'the cat sat on the mat',
['the cat sat on the mat'],
);
// score === 1.0
// Multiple references (uses max n-gram count from any reference)
const multi = bleuScore(candidate, [ref1, ref2, ref3]);Prop
Type
Uses whitespace tokenization. Scores are directionally correct for model comparison but may differ from NLTK BLEU absolute values.
rougeScore()
Compute ROUGE score for summarization evaluation:
import { rougeScore } from '@localmode/core';
const score = rougeScore('the cat sat', 'the cat sat');
// score === 1.0 (unigram overlap F1)import { rougeScore } from '@localmode/core';
const score = rougeScore(
'the cat sat on the mat',
'the cat sat on a mat',
{ type: 'rouge-2' },
);
// Bigram overlap F1import { rougeScore } from '@localmode/core';
const score = rougeScore(
'the cat is on the mat',
'the cat sat on the mat',
{ type: 'rouge-l' },
);
// Longest common subsequence F1Prop
Type
Retrieval Metrics
mrr()
Compute Mean Reciprocal Rank for retrieval evaluation:
import { mrr } from '@localmode/core';
const score = mrr(
[['a', 'b', 'c'], ['d', 'e', 'f']],
[['b'], ['f']],
);
// score === (1/2 + 1/3) / 2Prop
Type
ndcg()
Compute Normalized Discounted Cumulative Gain:
import { ndcg } from '@localmode/core';
// Perfect ranking
const score = ndcg(['a', 'b', 'c'], { a: 3, b: 2, c: 1 });
// score === 1.0
// NDCG at k=5
const atK = ndcg(rankedResults, relevanceScores, 5);Prop
Type
Vector Quality Metrics
evalCosineDistance()
Compute cosine distance between two vectors:
import { evalCosineDistance } from '@localmode/core';
const dist = evalCosineDistance(
new Float32Array([1, 0, 0]),
new Float32Array([1, 0, 0]),
);
// dist === 0.0 (identical direction)Returns a value between 0 (identical) and 2 (opposite). Returns 1.0 for zero vectors.
Exported as evalCosineDistance to avoid naming conflict with the HNSW cosineDistance function also exported from @localmode/core.
Model Evaluation Orchestrator
evaluateModel()
Run a model against a dataset, apply a metric, and get a structured report:
import { evaluateModel, accuracy } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.classifier('Xenova/distilbert-base-uncased-finetuned-sst-2-english');
const result = await evaluateModel({
dataset: {
inputs: ['great movie', 'terrible film', 'okay show'],
expected: ['POSITIVE', 'NEGATIVE', 'NEGATIVE'],
},
predict: async (text, signal) => {
const { classify } = await import('@localmode/core');
const { label } = await classify({ model, text, abortSignal: signal });
return label;
},
metric: accuracy,
});
console.log(result.score); // 0.67
console.log(result.predictions); // ['POSITIVE', 'NEGATIVE', 'POSITIVE']
console.log(result.datasetSize); // 3
console.log(result.durationMs); // 1234const result = await evaluateModel({
dataset,
predict,
metric: f1Score,
onProgress: (completed, total) => {
console.log(`Evaluating: ${completed}/${total}`);
},
});const controller = new AbortController();
// Cancel after 10 seconds
setTimeout(() => controller.abort(), 10_000);
try {
const result = await evaluateModel({
dataset,
predict,
metric: accuracy,
abortSignal: controller.signal,
});
} catch (error) {
console.log('Evaluation cancelled');
}EvaluateModelOptions
Prop
Type
EvaluateModelResult
Prop
Type
Custom Metrics
Create custom metric functions matching the MetricFunction type:
import type { MetricFunction } from '@localmode/core';
import { evaluateModel } from '@localmode/core';
// Custom metric: exact match ignoring case
const caseInsensitiveAccuracy: MetricFunction<string, string> = (predictions, labels) => {
let correct = 0;
for (let i = 0; i < predictions.length; i++) {
if (predictions[i].toLowerCase() === labels[i].toLowerCase()) correct++;
}
return correct / predictions.length;
};
const result = await evaluateModel({
dataset,
predict,
metric: caseInsensitiveAccuracy,
});React Hook
useEvaluateModel()
Wraps evaluateModel() with loading/error/cancel state:
import { useEvaluateModel } from '@localmode/react';
import { accuracy } from '@localmode/core';
function EvalPanel() {
const { data, isLoading, error, execute, cancel, reset } = useEvaluateModel();
const runEval = () =>
execute({
dataset: { inputs: texts, expected: labels },
predict: async (text, signal) => {
const { classify } = await import('@localmode/core');
const { label } = await classify({ model, text, abortSignal: signal });
return label;
},
metric: accuracy,
});
return (
<div>
<button onClick={runEval} disabled={isLoading}>Evaluate</button>
{isLoading && <button onClick={cancel}>Cancel</button>}
{data && <p>Score: {data.score.toFixed(3)}</p>}
{error && <p>Error: {error.message}</p>}
</div>
);
}Error Handling
All metric functions throw ValidationError on invalid inputs:
import { accuracy } from '@localmode/core';
// Empty arrays
accuracy([], []);
// => ValidationError: accuracy requires at least one prediction
// Mismatched lengths
accuracy(['a', 'b'], ['a']);
// => ValidationError: accuracy requires predictions and labels to have equal length
// Each error includes a hint:
// hint: "Ensure predictions and labels arrays have the same length."Limitations
- Whitespace tokenization:
bleuScore()androugeScore()split on whitespace. No stemming, stopword removal, or language-specific tokenization. - Simplified BLEU: Implements BLEU-4 with brevity penalty. Scores are directionally correct but may differ from NLTK BLEU.
- Simplified ROUGE: ROUGE-1, ROUGE-2, and ROUGE-L only. No stemming or stopword removal.
- Synchronous metrics: Metric functions are synchronous (pure math). For very large datasets (100K+ items), consider running in a Web Worker.