Transformers.js v4: What the WebGPU C++ Rewrite Actually Means

The Transformers.js v4 preview dropped on NPM in early February 2026, and the headline features are worth sitting with before the full release lands. The build speed improvements are real and the new model support is welcome, but the most consequential decision in v4 is the complete rewrite of the WebGPU runtime in C++. That single choice signals something specific about where browser-side ML inference is heading, and it deserves more attention than it typically gets in the announcement cycle.

Why the WebGPU Backend Needed a Rewrite

Transformers.js v3 shipped WebGPU support, but that support ran on top of ONNX Runtime Web’s existing JavaScript and WASM infrastructure. Layering WebGPU onto an existing WASM stack is a reasonable first pass, but it leaves performance on the table: you’re still paying JavaScript’s calling overhead and context-switching costs whenever the GPU needs to be invoked, and the memory layout decisions made for WASM don’t necessarily translate well to GPU memory models.

The v4 approach is different. HuggingFace worked directly with the ONNX Runtime team to write a new WebGPU runtime from scratch in C++. The result compiles to WASM and ships with local caching for offline use, but the core execution path to the GPU is now native. The numbers bear this out: BERT-based embedding models see roughly a 4x speedup using the new com.microsoft.MultiHeadAttention operator, and large models like GPT-OSS 20B at q4f16 quantization run at around 60 tokens per second on an M4 Pro Max. Those are not incremental improvements.

This matters beyond the benchmarks because it closes the architectural gap between what Transformers.js can do in a browser and what a native Python runtime can do on the same hardware. The gap will never fully close given browser sandbox constraints, but the v3-to-v4 jump is substantial enough to unlock use cases that were previously too slow to be practical.

Installing the Preview

The v4 preview is available on NPM under the next tag:

npm install @huggingface/transformers@next

The API surface is largely familiar if you have used v3. The pipeline abstraction works the same way at a high level, but the internals have been reorganized significantly. The codebase moved from a single 8,000-line models.js file to a proper monorepo structure using PNPM workspaces, and the build toolchain switched from Webpack to esbuild. Build times dropped from around 2 seconds to around 200ms. For a library of this complexity, that is a meaningful quality-of-life improvement during development.

Bundle sizes also improved. The transformers.web.js bundle specifically dropped by 53%, and average bundle sizes across the package are down around 10%. That is largely a consequence of the monorepo split making tree-shaking more effective, though the esbuild migration helps too.

The Standalone Tokenizers Package

Alongside v4, HuggingFace extracted a standalone @huggingface/tokenizers package. It is 8.8kB gzipped, has zero dependencies, and ships with full TypeScript types. The API is straightforward:

import { Tokenizer } from "@huggingface/tokenizers";

const tokenizerJson = await fetch(
  "https://huggingface.co/HuggingFaceTB/SmolLM3-3B/resolve/main/tokenizer.json"
).then(res => res.json());

const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
const encoded = tokenizer.encode("Hello World");
// { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], ... }

This is worth considering separately from the v4 core. Tokenizers are the kind of utility that ends up getting re-implemented in application code constantly, because pulling in a full ML inference library just to tokenize a string is overkill. Having a standalone, dependency-free package fills a real gap in the JavaScript ecosystem. It lets you compute token counts client-side for UI feedback, pre-process text before sending to an API, or build prompt construction tooling, all without shipping inference infrastructure alongside it.

The 8.8kB size is competitive. tiktoken is available as a WASM port for JavaScript, but it is OpenAI-specific. HuggingFace’s tokenizer format covers a much broader set of models, and having it as a first-party, typed JavaScript package lowers the friction considerably.

What the New Model Support Opens Up

V4 adds support for models above 8B parameters, along with new architectures including Mamba state-space models, Multi-head Latent Attention as used in DeepSeek-style models, and Mixture of Experts. The newly supported model families include GPT-OSS, LFM2-MoE, GraniteMoeHybrid, Olmo3, and FalconH1, among others.

The >8B parameter support deserves specific mention. V3 had practical limitations that made larger models difficult to run in browser contexts, both from memory management and quantization support perspectives. The v4 rewrite, combined with improved WebGPU memory handling, makes running 13B or 20B quantized models a realistic target. At 60 tokens per second for a 20B q4f16 model on M4 hardware, you are looking at response latency comparable to API calls for short prompts, which is the threshold where local inference starts to feel viable for interactive use cases.

How This Fits Against the Broader Landscape

Transformers.js is not the only option for browser-side ML inference. TensorFlow.js has been around longer and has a larger surface area, but its model ecosystem is separate from the HuggingFace hub and conversion pipelines are more involved. ONNX Runtime Web is what Transformers.js builds on top of, so it functions more as a dependency than a competitor. WebLLM from MLC is perhaps the closest competitor for LLM-specific inference; it also uses WebGPU and has strong performance numbers, but it is scoped to LLMs rather than the full transformer task spectrum.

What Transformers.js has that most alternatives lack is direct, first-party integration with the HuggingFace hub. Loading a model from the hub, running it locally in a browser, and falling back gracefully when the model is not cached, all of that works out of the box. The hub integration alone justifies the package for anyone already in the HuggingFace ecosystem, and v4 makes the runtime capable enough that the tooling story no longer feels like the weak link.

What to Watch Before the Full Release

The v4 preview is stable enough to experiment with, but the HuggingFace team has indicated ongoing changes before the final release. The primary areas to watch are migration guides for projects currently on v3, since the monorepo restructuring and TypeScript improvements will likely surface some API changes, and expanded platform support documentation for Node.js, Bun, and Deno, all of which are listed as supported targets for the new WebGPU runtime.

The cross-runtime WebGPU support is particularly interesting for server-side use cases. Running transformer inference in Node.js or Deno with GPU acceleration, using the same model artifacts and the same JavaScript API as the browser version, is a meaningful convergence. It means the same application code can run inference locally in the browser during development and on a GPU server in production, without switching between Python inference servers and JavaScript frontends.

The full v4 release date has not been announced. The preview is available now via npm i @huggingface/transformers@next, and the examples repository has been separated into its own repo for v4. Both are worth following if you are building anything that involves running models outside the traditional Python-plus-API-server setup.