Shipping a Transformer Inside a Chrome Extension

Hugging Face recently published a walkthrough on building a Chrome extension with Transformers.js, demonstrating a small extension that runs a sentiment classifier on selected text. The tutorial itself is short, but the architecture it implies is more interesting than the example suggests. Running a transformer model entirely inside a browser extension touches on service workers, ONNX Runtime Web, WebGPU, content security policies, and the strange ergonomics of MV3. I want to dig into the parts the post glosses over.

Why an extension is a good fit for Transformers.js

Transformers.js is the JavaScript port of the Hugging Face transformers library, built on top of ONNX Runtime Web. It exposes a pipeline() API that mirrors the Python one closely enough that porting code is mostly a matter of swapping await in. Under the hood, models are pulled from the Hub as .onnx files (often quantized to int8 or fp16) and executed via WASM SIMD, WebGL, or WebGPU depending on the backend.

The browser is an awkward host for ML in general because every page load is a cold start. A typical web app has to download the model on first visit, warm the runtime, and pay that cost again whenever the user clears their cache or navigates away. Extensions sidestep this. The model lives in the extension’s package or in chrome.storage, the service worker keeps the runtime resident, and inference happens with no network round trip. For something like a per-selection sentiment tag or an inline translator, this is the difference between a 3-second wait and a 50ms response.

Manifest V3 and the service worker problem

The one detail the Hugging Face post moves past quickly is that Manifest V3 replaced background pages with service workers. That matters here because service workers in Chrome are not the same animal as service workers on the open web. They can be terminated after 30 seconds of inactivity, they cannot hold arbitrary DOM references, and until recently they could not use importScripts for ES modules.

Loading a transformer in a service worker means the model session has to survive across these terminations, or be cheap enough to reinitialize. Transformers.js handles this with its internal pipeline cache, but the practical implication is that the first invocation after the worker has been idle will pay a reload cost. The fix most extensions land on is to either keep the worker alive with a heartbeat alarm (Chrome bumped the lifetime ceiling in 2024 but the timeout still bites) or to accept the warm-up and show a loading state.

There is a second option that the tutorial uses: run inference in an offscreen document. Chrome’s chrome.offscreen API was added precisely because service workers cannot touch DOM APIs, and some ONNX backends, particularly WebGPU, need a document context to grab a GPUAdapter. An offscreen document gives you a hidden page that can hold the model session, talk to the GPU, and message back to the worker. It is more plumbing, but it unlocks the faster backends.

The backend choice

Transformers.js will pick a backend at runtime. The default order is roughly WebGPU, then WASM with SIMD and threading, then plain WASM. The performance gap between these is large. For a model like Xenova/distilbert-base-uncased-finetuned-sst-2-english, WASM SIMD on a recent laptop runs in the 30-80ms range per inference. WebGPU on the same hardware drops that to single-digit milliseconds for the same model, and the gap widens dramatically for larger models. ONNX Runtime’s WebGPU benchmarks show 5x to 20x speedups depending on the operator mix.

The catch is browser support. WebGPU shipped in Chrome 113 in 2023, but Firefox only enabled it by default in 2025, and Safari’s support is still gated behind a flag on most platforms. For an extension, this is less of a problem since you can target Chromium only and rely on WebGPU being present. The Hugging Face example does not force WebGPU, which is the right default; it falls back gracefully on machines without a compatible adapter.

Quantization is the other lever. The default ONNX exports on the Hub for Transformers.js compatible models usually ship in two flavors: a full fp32 version and a quantized int8 version. The quantized models are often 4x smaller and run noticeably faster on WASM backends, with a small accuracy drop. For an extension where the model is bundled in the package, the size difference matters: a 25MB extension feels different from a 100MB one on install.

CSP, bundling, and the long path to a working build

The practical hurdles when you try to actually ship this are not the inference code. They are the build pipeline. Transformers.js loads its WASM binaries from a CDN by default, which Manifest V3’s content security policy will reject. You have to host the ONNX Runtime WASM files inside the extension and point the library at them with env.backends.onnx.wasm.wasmPaths. The Transformers.js docs cover this in the “Use custom models” section, and it is the single most common stumbling block.

Similarly, model loading. By default Transformers.js fetches from huggingface.co. Inside an extension you can either keep that behavior, which requires adding the host to host_permissions, or bundle the model files directly and set env.allowRemoteModels = false with env.localModelPath pointing at the local copy. Bundling is the right call for anything you want to work offline, but it means using something like Webpack or Vite with a copy plugin to pull the ONNX files into the build output.

There is a reference template repository that gets these details right, which is worth starting from rather than reinventing the manifest fields.

Where this slots into the broader picture

It is worth comparing this to the alternatives. Chrome itself is pushing built-in AI APIs backed by an on-device Gemini Nano. Those are gated behind flags, available only to a small set of origins, and not yet usable in extensions in any portable way. The browser-vendor approach will eventually be faster and lighter, since the model is shared across all sites, but the timeline is years not months, and the model selection is whatever Google ships.

Transformers.js is the path that works today, for any model already converted to ONNX, with no permission gating and no vendor lock-in. The cost is package size and the build complexity above. For an extension that needs to run a specific fine-tuned classifier or a translation model the browser does not provide, it is the only realistic option.

There is also a category of extension that has emerged around this stack: local-first AI tools that explicitly do not send your text to a server. WebLLM takes the same approach for larger LLMs using WebGPU compute shaders, and there are now extensions that run 1B-3B parameter models entirely in the browser for tasks like summarization. They are slow on consumer hardware and the model load times are punishing, but they exist and they work.

What to take away

The Hugging Face post is a useful starting point, but the interesting work is in the parts they hand-wave: keeping the service worker alive, choosing between WASM and WebGPU, configuring CSP for the WASM artifacts, and deciding whether to bundle the model or fetch it on first use. Once you have those settled, the inference code itself is six lines. If you have been holding off on adding ML features to an extension because the browser-as-a-runtime story felt unfinished, it is closer to ready than it was a year ago. The infrastructure around ONNX Runtime Web and WebGPU has matured to the point where small classifiers, embedding models, and even modest generative models are workable inside the MV3 sandbox.