· 5 min read ·

What It Actually Takes to Run Gemma 4 Offline on an iPhone

Source: hackernews

Google recently demonstrated Gemma 4 running natively on iPhone with full offline inference, meaning no server calls, no network round-trips, and no API keys. The headline is real and the capability is genuinely useful. The more interesting story, though, is what the implementation actually looks like under the hood, what tradeoffs Google made to ship this, and how it fits into a landscape where several competing stacks are already fighting for the same hardware.

The Gemma 4 Model and Why Size Matters Here

Gemma 4 follows Google’s pattern from Gemma 3, which shipped variants at 1B, 4B, 12B, and 27B parameters. The smaller sizes, particularly 1B and 4B, were explicitly designed with edge deployment in mind. On a phone, the upper bound is set by available RAM. The iPhone 15 Pro and iPhone 16 series carry 8GB of unified memory, shared between the CPU, GPU, and Neural Engine. A 4B parameter model in FP16 would consume roughly 8GB just for weights, leaving no headroom for the operating system, the KV cache, or anything else. INT4 quantization is what makes this workable: a 4B model at INT4 is approximately 2GB, which leaves enough room to actually run inference without the device paging aggressively.

The Gemma 4 1B variant is the practical target for most iPhones. At INT4, it fits comfortably on devices with 6GB of RAM, which includes a larger portion of the installed iPhone base.

How Google Actually Ships This: MediaPipe’s LLM Inference API

Google’s path to running Gemma on iOS goes through MediaPipe’s LLM Inference task, not llama.cpp, not Core ML, and not Apple’s own frameworks. MediaPipe packages models into a .task bundle, a Flatbuffer-based container that includes the quantized weights and the SentencePiece tokenizer in a single file. The inference engine itself is Google’s own runtime, layered on top of Metal for GPU access on iOS.

The Swift integration is straightforward:

import MediaPipeTasksGenAI

let options = LlmInference.Options(modelPath: modelPath)
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8

let llm = try LlmInference(options: options)

// Streaming output
try llm.generateResponseAsync(inputText: prompt) { partialResult, error in
    guard let token = partialResult else { return }
    print(token, terminator: "")
}

The maxTokens parameter sets the KV cache size at initialization, which is a design decision that trades flexibility for memory predictability. You declare your maximum sequence length upfront, and the runtime allocates accordingly. This matters on a constrained device where dynamic allocation can cause visible latency spikes.

Models are distributed through Kaggle’s model hub, and Google maintains sample iOS apps in the mediapipe-samples repository under examples/llm_inference/ios.

The ANE Gap

Here is where the implementation diverges from what Apple does internally with Apple Intelligence. The Apple Neural Engine is a dedicated matrix multiply accelerator, and it is the fastest path for transformer inference on Apple Silicon. But Apple does not expose the ANE through any public API that third-party developers can use directly for arbitrary models. Core ML is the official public interface, and Core ML does route operations to the ANE, but with significant constraints on quantization schemes, operator support, and scheduling priority.

Apple Intelligence uses a private, tightly integrated stack that can schedule directly against the ANE hardware. Third-party frameworks, including MediaPipe, cannot do this. MediaPipe on iOS runs on the GPU via Metal, which is fast, but not as fast as ANE-optimized inference. Rough benchmarks from mid-2025 for a 2B INT4 model on iPhone 15 Pro suggest:

  • MediaPipe (Metal GPU): approximately 15 to 25 tokens per second
  • llama.cpp (Metal backend): approximately 20 to 30 tokens per second
  • Apple Intelligence (private ANE stack): estimated 30 to 50 tokens per second for Apple’s own 3B model

MLC LLM, which uses Apache TVM to compile models into Metal shaders at deployment time, is another competitor in this space. The MLC LLM project supports Gemma and achieves similar Metal GPU throughput to llama.cpp, sometimes faster depending on the model architecture and the quality of the compiled kernels.

The Existing iOS LLM Ecosystem

Gemma 4 on iPhone is a milestone, but it is not the first rodeo. The llama.cpp project has supported iOS compilation for years. Apps like LLM Farm and Enchanted have been running GGUF models locally on iPhones since 2023. The GGUF format is notably more flexible than MediaPipe’s .task bundle; any model with a GGUF file works, which covers LLaMA 3, Mistral, Phi, Gemma, and dozens of fine-tunes.

The main thing MediaPipe offers that llama.cpp does not is Google’s official distribution channel and Google’s official support story. For a developer building a product, “Google ships this, here is the SDK” is a meaningfully different situation from “compile llama.cpp for arm64, handle your own model distribution, and debug Metal shader compilation failures yourself.” The ergonomics matter.

Meta’s ExecuTorch is another emerging option. It targets Core ML and therefore can access the ANE through the public API, which gives it a potential performance edge over Metal-only runtimes for models that fit within Core ML’s supported operator set. The toolchain is still early, but the architectural advantage of ANE access is real.

What Offline Actually Means in Practice

The privacy argument for on-device inference is stronger than it often gets credit for. When inference runs locally, your prompts never leave the device. For consumer applications, that is a comfort; for enterprise use cases involving sensitive data, it can be a compliance requirement. The offline capability also has practical value in low-connectivity situations: aviation mode, spotty rural coverage, corporate networks that block external AI API endpoints.

The latency story is mixed. A local 1B model at 30 to 40 tokens per second feels responsive for conversational use. Compared to a cloud-hosted 70B model with 200ms network latency and server-side processing overhead, the local model is often faster to first token. It is not competitive on quality for complex reasoning tasks, but it does not need to be for the use cases where it fits.

The Competitive Dynamics

Google pushing Gemma 4 onto Apple’s hardware through their own inference stack is an interesting strategic move. It keeps Google’s model family relevant on the platform that Apple Intelligence otherwise dominates by default. Users who want locally-running AI on their iPhone do not have to reach for a third-party GGUF app; they can use a Google-provided SDK with Google’s models.

Apple’s response, if there is one beyond what Apple Intelligence already does, will likely involve better public Core ML support for INT4 quantized transformer models and potentially expanded ANE access for specific operator patterns. The coremltools library already supports quantization via coremltools.optimize, but the coverage of LLM-specific operators and the tooling around sequence generation is less mature than MediaPipe or llama.cpp.

For now, Google has a working, documented, SDK-backed path for running Gemma 4 offline on iPhone. It runs on the GPU rather than the Neural Engine, which leaves some performance on the table. The model quality at the 1B to 4B scale is genuinely useful for summarization, code completion, and conversational tasks that do not require deep reasoning. The distribution story through Kaggle and the MediaPipe SDK is cleaner than rolling your own llama.cpp pipeline. As a developer targeting on-device AI on iOS, the ecosystem just got a more credible option.

Was this interesting?