MediaPipe Tasks provider for LocalMode — hand, pose, and face landmark detection, gesture recognition, audio classification, language detection, and more via Google's on-device WASM runtime. Works in every browser.

@localmode/mediapipe

Run Google's MediaPipe Tasks directly in the browser. @localmode/mediapipe wraps @mediapipe/tasks-vision, @mediapipe/tasks-audio, and @mediapipe/tasks-text as a single unified LocalMode provider — hand, pose, and face landmarks, gesture recognition, image and audio classification, image embeddings, language detection, and text embeddings, plus real-time streaming trackers for video.

Privacy-first by design

Every task runs entirely on-device via WebAssembly. The camera, microphone, and text never leave the browser. The only network requests are the one-time model and WASM downloads.

Features

13 model catalog — landmarks, gestures, classification, detection, segmentation, embeddings, and language detection, all curated and verified against Google's CDN
Real-time streaming — createHandTracker / createPoseTracker / createFaceTracker / createGestureTracker run over a <video> element at 30–60fps
Universal browser support — pure WebAssembly + WebGL; no WebGPU required
Tiny models — most are under 10MB; face detection and selfie segmentation are ~250KB
Unified LocalMode interface — landmark tasks call detectHands(), detectPose(), etc.; classification/detection/embedding tasks reuse the existing classifyImage(), detectObjects(), embed() functions
AbortSignal cancellation — every single-frame function supports cancellation
GPU or CPU delegate — choose the WebGL GPU delegate (default) or the CPU delegate per provider or per model

What is not included

Two MediaPipe capabilities are intentionally excluded: the @mediapipe/tasks-audio package ships only an AudioClassifier — there is no audio embedder in the JS SDK. The @mediapipe/tasks-genai (on-device LLM) is also not wrapped — use @localmode/litert or @localmode/wllama for local language model inference instead.

Installation

bash pnpm install @localmode/mediapipe @localmode/core

bash npm install @localmode/mediapipe @localmode/core

bash yarn add @localmode/mediapipe @localmode/core

bash bun add @localmode/mediapipe @localmode/core

The package depends on @mediapipe/tasks-vision, @mediapipe/tasks-audio, and @mediapipe/tasks-text — all installed automatically. The WASM runtime loads from the jsDelivr CDN by default; see WASM Runtime to self-host it.

Quick Start

Single-frame detection

Detect hand landmarks in a still image:

import { detectHands } from '@localmode/core';
import { mediapipe } from '@localmode/mediapipe';

const { hands } = await detectHands({
  model: mediapipe.handLandmarker(),
  image: imageBlob,
  numHands: 2,
});

for (const hand of hands) {
  console.log(`${hand.handedness} hand — ${hand.landmarks.length} landmarks`);
}

Real-time streaming

Track hands live from a webcam <video> element:

import { mediapipe } from '@localmode/mediapipe';

const tracker = mediapipe.createHandTracker({
  video: videoElement,
  numHands: 2,
  onResults: (hands, timestampMs) => {
    // Called once per processed frame (up to ~60fps)
    drawHands(hands);
  },
});

await tracker.start();

// later
tracker.stop();
await tracker.close();

Tasks

The provider exposes a factory method per task. Landmark and gesture tasks return MediaPipe-specific model interfaces used with new core functions; the remaining tasks return standard LocalMode interfaces and reuse existing core functions.

Landmarks & Gestures

Method	Interface	Core function	Docs
`mediapipe.handLandmarker()`	`HandLandmarkModel`	`detectHands()`	Guide
`mediapipe.poseLandmarker()`	`PoseLandmarkModel`	`detectPose()`	Guide
`mediapipe.faceLandmarker()`	`FaceLandmarkModel`	`detectFaceLandmarks()`	Guide
`mediapipe.faceDetector()`	`FaceDetectionModel`	`detectFace()`	Guide
`mediapipe.gestureRecognizer()`	`GestureRecognitionModel`	`recognizeGesture()`	Guide

Vision (standard core interfaces)

Method	Interface	Core function
`mediapipe.imageClassifier()`	`ImageClassificationModel`	`classifyImage()`
`mediapipe.objectDetector()`	`ObjectDetectionModel`	`detectObjects()`
`mediapipe.imageSegmenter()`	`SegmentationModel`	`segmentImage()`
`mediapipe.imageEmbedder()`	`ImageFeatureModel`	`extractImageFeatures()`

Audio & Text

Method	Interface	Core function	Docs
`mediapipe.audioClassifier()`	`AudioClassificationModel`	`classifyAudio()`	Guide
`mediapipe.textEmbedder()`	`EmbeddingModel`	`embed()` / `embedMany()`	Guide
`mediapipe.languageDetector()`	`LanguageDetectionModel`	`detectLanguage()`	Guide
`mediapipe.textClassifier(modelPath)`	`ClassificationModel`	`classify()`	Guide

textClassifier requires a custom model

MediaPipe ships no default text classifier — mediapipe.textClassifier() requires an explicit custom-trained .tflite model URL (built with MediaPipe Model Maker). Calling it without a path throws a ValidationError. See the Text guide.

Streaming trackers

Factory	Per-frame callback	Docs
`mediapipe.createHandTracker()`	`(hands: HandLandmarkResultItem[], timestampMs)`	Guide
`mediapipe.createPoseTracker()`	`(poses: PoseLandmarkResultItem[], timestampMs)`	Guide
`mediapipe.createFaceTracker()`	`(faces: FaceLandmarkResultItem[], timestampMs)`	Guide
`mediapipe.createGestureTracker()`	`(gestures: GestureResultItem[], timestampMs)`	Guide

Model Catalog

MEDIAPIPE_MODELS ships 13 curated models. Every model is verified against Google's public CDN (storage.googleapis.com). Vision models use the .task bundle format; audio and text models are raw .tflite files.

import { MEDIAPIPE_MODELS } from '@localmode/mediapipe';

Catalog ID	Model	Domain	Size
`hand_landmarker`	Hand Landmarker	vision	7.8MB
`pose_landmarker`	Pose Landmarker (Lite)	vision	5.8MB
`pose_landmarker_full`	Pose Landmarker (Full)	vision	9.4MB
`face_landmarker`	Face Landmarker	vision	3.8MB
`face_detector`	Face Detector (BlazeFace)	vision	230KB
`gesture_recognizer`	Gesture Recognizer	vision	8.4MB
`image_classifier`	Image Classifier (EfficientNet-Lite0)	vision	18.6MB
`object_detector`	Object Detector (EfficientDet-Lite0)	vision	7.3MB
`image_segmenter`	Image Segmenter (Selfie)	vision	250KB
`image_embedder`	Image Embedder (MobileNet-V3 Small)	vision	4.1MB
`audio_classifier`	Audio Classifier (YAMNet)	audio	4.1MB
`language_detector`	Language Detector	text	315KB
`text_embedder`	Text Embedder (Universal Sentence Encoder)	text	6.1MB

Each factory method uses its catalog default. Pass a catalog ID or a direct model URL to override:

// Catalog default
const lite = mediapipe.poseLandmarker();

// Catalog ID — higher-accuracy pose model
const full = mediapipe.poseLandmarker('pose_landmarker_full');

// Direct URL to any compatible MediaPipe model file
const custom = mediapipe.handLandmarker('https://your-cdn.com/hand_landmarker.task');

You can also set an explicit modelPath in the per-model settings, which always takes precedence:

const model = mediapipe.handLandmarker('hand_landmarker', {
  modelPath: 'https://your-cdn.com/hand_landmarker.task',
});

Provider Configuration

Custom provider

Use createMediaPipe() for a provider instance with custom settings, or import the default mediapipe singleton.

import { createMediaPipe } from '@localmode/mediapipe';

const myMediaPipe = createMediaPipe({
  delegate: 'CPU',
  wasmBasePath: '/wasm/mediapipe',
});

const model = myMediaPipe.handLandmarker();

Prop

Type

Per-model settings

Every factory method accepts a second settings argument that overrides the provider defaults for that model only:

const model = mediapipe.handLandmarker('hand_landmarker', {
  modelPath: 'https://your-cdn.com/hand_landmarker.task',
  delegate: 'CPU',
  wasmBasePath: '/wasm/mediapipe/vision',
});

Prop

Type

WASM Runtime

MediaPipe Tasks ships a WebAssembly runtime — one binary set per domain (vision, audio, text). By default, @localmode/mediapipe loads it from the jsDelivr CDN, so no extra setup is needed.

To run fully offline, self-host the runtime. Copy the wasm directory from each @mediapipe/tasks-* package into your public assets and point wasmBasePath at it:

const myMediaPipe = createMediaPipe({
  wasmBasePath: {
    vision: '/wasm/tasks-vision',
    audio: '/wasm/tasks-audio',
    text: '/wasm/tasks-text',
  },
});

Isolate concurrent audio and vision tasks

The MediaPipe audio and vision WASM runtimes can conflict if run concurrently in the same thread (mediapipe#4737). If your app uses audio classification and a vision task at the same time, run one of them in a Web Worker so each runtime has its own thread. Sequential use (one finishing before the next starts) is unaffected.

Browser Compatibility

MediaPipe Tasks runs on pure WebAssembly with a WebGL GPU delegate — it does not require WebGPU.

Browser	WASM	WebGL (GPU delegate)	Notes
Chrome 80+	✅	✅	Full support
Edge 80+	✅	✅	Full support
Firefox 75+	✅	✅	Full support
Safari 14+	✅	✅	Full support; module workers 16.4+

If the WebGL GPU delegate is unavailable, set delegate: 'CPU' to fall back to CPU inference.

Error Handling

import { detectHands, ModelLoadError, VisionError, ValidationError } from '@localmode/core';

try {
  const { hands } = await detectHands({
    model: mediapipe.handLandmarker(),
    image: imageBlob,
  });
} catch (error) {
  if (error instanceof ModelLoadError) {
    console.error('Failed to load model:', error.hint);
  } else if (error instanceof VisionError) {
    console.error('Detection failed:', error.hint);
  } else if (error instanceof ValidationError) {
    console.error('Invalid input:', error.hint);
  }
}

Single-frame functions retry transient failures up to maxRetries (default 2) before throwing. Streaming trackers report per-frame errors through the onError callback instead of throwing.

App	Description	Links
MediaPipe Studio	Live hand, pose, face, and gesture tracking from your webcam	Demo · Source

Overview