Gemma 4 on Your Phone: The Inference Stack That Makes It Work

Google’s AI Edge Gallery app landing on the App Store with Gemma 4 support prompted a lively Hacker News thread, and the reaction was a mix of genuine excitement and the usual skepticism about on-device LLM demos. That reaction is fair. We have had on-device inference for a while now, between llama.cpp, MLX, and various iOS wrappers. But the story here is less about whether a phone can run a language model and more about the specific stack Google has assembled to make this feel like a real product rather than a benchmark.

The Gemma Family and Why Small Matters

Gemma began as Google’s answer to a simple question: what if the open-weight model was designed from the start to run efficiently, rather than being a shrunken version of something larger? Gemma 1 launched in February 2024 with 2B and 7B variants. Gemma 2 followed in June 2024 with a redesigned architecture, interleaved local and global attention, and a training regime that distilled knowledge from larger Gemini models. By Gemma 3, released in March 2025, the 1B and 4B variants had 128K context windows and the 4B was genuinely multimodal.

Gemma 4 continues this trajectory, with model sizes calibrated for mobile hardware rather than being afterthoughts. The key architectural decision across the Gemma line is that the smaller models are not just cut-down versions of bigger ones; they are trained with a different objective and benefit from knowledge distillation. When you run Gemma 4’s smallest variant on an iPhone, you are not running something that was originally designed to live on a data center GPU and got quantized down until it barely fit.

LiteRT: The Runtime Doing the Heavy Lifting

The execution environment matters as much as the model architecture. Google AI Edge Gallery runs on LiteRT, which Google rebranded from TensorFlow Lite in late 2024 as part of consolidating its on-device AI stack under the Google AI Edge umbrella. LiteRT is not just a lighter TensorFlow; it is a purpose-built runtime for constrained inference with a delegate architecture that lets it hand off computation to hardware accelerators.

On iPhone, LiteRT can use the CoreML delegate, which routes computation through Apple’s CoreML framework and ultimately down to the Neural Engine. This is significant. The Apple Neural Engine in recent iPhones, from the A17 Pro onward, is designed specifically for the matrix multiplications that dominate transformer inference. Apple’s own documentation claims the A18’s Neural Engine can handle 35 trillion operations per second. Whether that figure translates directly to token generation throughput depends heavily on model shape and quantization format, but it is a real hardware advantage.

Alternatively, LiteRT can use its GPU delegate backed by Metal, trading some peak efficiency for broader hardware compatibility. In practice, for Gemma 4’s smaller variants, the Neural Engine path on recent iPhones delivers noticeably better tokens-per-second with lower battery draw.

Quantization: Where the Numbers Actually Get Interesting

Running a multi-billion parameter model on a phone requires quantization, and the choices made here have more impact than almost anything else. LiteRT supports INT8 and INT4 weight quantization. Moving from FP16 to INT4 cuts memory footprint by 75%, which is the difference between a model fitting in 8GB of DRAM and not fitting at all.

For Gemma 4’s architecture, Google uses a mixed-precision scheme: weights are stored at INT4, but activations remain at higher precision for the layers where numerical range matters most. This is similar to the GPTQ and AWQ approaches that the open-source community developed for running LLaMA models locally, but integrated into the LiteRT export pipeline rather than requiring a separate quantization step.

The MediaPipe LLM Inference API, which powers the chat interface in AI Edge Gallery, accepts models in .task format, which bundles the quantized weights, tokenizer, and inference graph together. This bundled format means the app can load a model without requiring the user to manage separate files, which matters for the product experience even if it means less flexibility for experimentation.

How This Compares to the Alternatives

llama.cpp remains the most capable open-source on-device inference engine. Its Metal backend on Apple Silicon is well-optimized, and the community around it has produced INT4 quants for essentially every popular open-weight model. On an iPhone, llama.cpp via apps like LLM Farm or Enchanted can run Llama 3.2 3B or Gemma 3 4B at usable speeds. The rough baseline on an iPhone 15 Pro is somewhere between 15 and 30 tokens per second for a 4B INT4 model, depending on context length and prompt structure.

Apple’s own MLX framework takes a different approach, targeting the unified memory architecture of Apple Silicon directly without going through CoreML. MLX is excellent on Mac but its iOS story is more limited; it does not get direct Neural Engine access in the same way CoreML does.

What Google’s stack offers that llama.cpp does not is a production-quality packaging story. The .task bundle format, the in-app model downloader, and the MediaPipe API surface are designed for app developers who want to ship an AI feature, not researchers who want to benchmark inference engines. This matters if you are building something on top of it.

The Memory Constraint Nobody Talks About

The iPhone 16 Pro ships with 8GB of RAM. The iPhone 16 base model has 8GB as well, up from 6GB in the iPhone 15. A Gemma 4 4B model at INT4 quantization requires roughly 2.5GB for the weights alone, before accounting for the KV cache. With a 128K context window, the KV cache can consume several additional gigabytes depending on how much context is actually loaded.

In practice, AI Edge Gallery manages this by limiting context length in the mobile configuration. You are not getting the full 128K window on a phone. The app sets a maximum context size that keeps peak memory usage within the bounds of what iOS will allow before triggering the jetsam memory pressure killer. On an iPhone 16, this typically means a context window in the range of 8K to 16K tokens for a 4B model, which is still genuinely useful for most tasks.

The situation is better on iPad Pro models with the M-series chips, which can have 16GB or more of unified memory. On an M4 iPad Pro with 16GB, a Gemma 4 4B model can run with a much larger effective context window, and the inference speed is substantially higher.

Why Running On-Device Still Matters

The easy dismissal of on-device LLMs is that cloud inference is faster, cheaper per token, and access to larger models. That is true for any given request. The reason on-device remains worth pursuing comes down to three things: latency, privacy, and offline capability.

For latency, round-trip to a cloud API adds 200 to 800 milliseconds of network overhead before the first token even starts generating, on a good connection. On-device generation can start in under 100ms. For interactive applications like autocomplete, real-time translation, or UI assistance, this difference is perceptible.

For privacy, a model running entirely on-device means the input never leaves the hardware. This matters for medical note-taking apps, private journaling, legal document review, and any context where sending data to a third-party server is a compliance or trust problem.

For offline capability, cloud inference is simply unavailable without a connection. On-device models work on a plane, in a basement, or in any connectivity-constrained environment.

None of these arguments are new. What is new is that the models are now good enough that the on-device compromise is acceptable for a broader range of tasks. Running Gemma 4 on a phone two years ago would have meant running a model with noticeably worse reasoning than the cloud alternatives. The gap has narrowed considerably.

The Bigger Picture

Google shipping an official iOS app that runs its open-weight models on-device is not just a technical demo. It is a signal about where Google thinks the on-device AI market is going, and it puts a stake in the ground against Apple Intelligence, which is built on Apple’s proprietary on-device models and tightly integrated into the OS rather than exposed as a developer API.

The AI Edge Gallery is not that integration. It is a showcase. But the MediaPipe LLM Inference API underneath it is a documented, stable API that third-party iOS developers can build on. Google has made Gemma weights openly available under the Gemma Terms of Use, which allows commercial use with attribution. Combining open weights with a production-ready mobile inference SDK creates a path for developers who want on-device AI capabilities without signing up for Apple’s developer program restrictions or waiting for Apple to expose Neural Engine access through a general-purpose API.

There is a real ecosystem forming here. The models are good enough, the hardware is capable enough, and the tooling has matured enough that on-device AI on mobile is moving from “possible in a demo” to “shippable in a product.” Gemma 4 on iPhone is one more data point in that direction.