· 7 min read ·

Gemma 4 on iPhone: The Quantization and Runtime Stack Behind Offline Inference

Source: hackernews

The news that Gemma 4 runs natively on iPhone with full offline inference will read as a headline to most people and as an engineering question to anyone who has tried to shove a transformer into a mobile process. The question worth spending time on is not whether the demo works but what it takes to make the demo work, and what changes once the inference is on-device rather than over a network call.

Which Model, Exactly

Gemma 4, released in April 2026, is a multimodal model family from Google DeepMind spanning four weight classes: 1B, 4B, 12B, and 27B parameters. The multimodal variants add vision capability alongside text generation. When any source says a model “runs on iPhone,” the first clarifying question is which size.

A 4B parameter model in bfloat16 occupies roughly 8GB of memory. The iPhone 16 Pro has 8GB of unified memory total, which leaves no room for the OS, the app, the KV cache, or the model’s activation buffers. The 12B and 27B variants are not in contention for a phone at all in full precision. So the iPhone story is specifically about the 1B and 4B models, and it depends entirely on quantization.

INT4 quantization compresses each weight from 16 bits to 4 bits, cutting the raw parameter storage to roughly one quarter. A 4B model in INT4 lands around 2.0 to 2.5GB, which is comfortably runnable alongside a live app on an iPhone 15 Pro or later. The 1B model in INT4 sits under 1GB, which opens the door to older devices with less RAM.

What INT4 Quantization Actually Does

The compression is real but not free. Quantizing to INT4 introduces rounding error: the original bfloat16 weight can represent roughly 65,000 distinct values in its range, while INT4 can represent 16. The mapping between them determines how much task quality degrades.

Naive round-to-nearest quantization degrades quality noticeably at INT4, which is why modern quantization methods use calibration. GPTQ (Generative Pre-Trained Transformer Quantization) minimizes the layer-wise reconstruction error using a small calibration dataset, adjusting weights so that the quantized outputs approximate the original outputs rather than just quantizing values independently. AWQ (Activation-aware Weight Quantization) takes a related approach, scaling weights according to the activation magnitudes that each channel will encounter during inference, preserving precision where it matters most.

For a 4B model like Gemma 4, the quality gap from FP16 to INT4 with good calibration is measurable on benchmarks but generally not noticeable for practical tasks: summarization, instruction following, short-context generation. For a 1B model, the quantization penalty is more significant because there is less redundancy to absorb the precision loss, but the 1B class was never a reasoning powerhouse in the first place.

The serialization format matters alongside the quantization scheme. Google’s deployment path through LiteRT uses a FlatBuffer-based format with block-wise quantization. The GGUF format popularized by llama.cpp offers multiple INT4 variants (Q4_K_M, Q4_K_S, Q4_0) that differ in how they handle block scaling factors, trading file size against quality. Both approaches end up in the same weight class at inference time, but the runtime that consumes the format varies.

The Runtime: Google AI Edge and LiteRT

Google has been building toward on-device ML deployment under the AI Edge umbrella for several years. TensorFlow Lite was rebranded to LiteRT (Lite Runtime) and given a more framework-agnostic interface that accepts models converted from PyTorch via AI Edge Torch, from JAX, and from TensorFlow SavedModel. The rebranding signals intent: this is not TensorFlow’s mobile experiment, it is a standalone runtime meant to outlive the TensorFlow ecosystem.

For LLM inference specifically, the MediaPipe LLM Inference API sits on top of LiteRT and handles the scaffolding specific to autoregressive generation: KV-cache allocation and reuse, top-k and top-p sampling, streaming token output via callbacks, and session management for multi-turn conversations. The API is available on both Android and iOS, which is what makes the iPhone story clean. You are not reverse-engineering Apple’s toolchain; you are running Google’s inference stack on top of it.

A minimal integration in Swift looks roughly like this:

import MediaPipeTasksGenAI

let options = LlmInferenceOptions()
options.modelPath = Bundle.main.path(
    forResource: "gemma-4-4b-it-int4",
    ofType: "bin"
)!
options.maxTokens = 1024
options.topk = 40
options.temperature = 0.8

let inference = try LlmInference(options: options)

try inference.generateResponseAsync(
    inputText: prompt
) { partialResult, error in
    guard let token = partialResult else { return }
    // stream token to UI
}

The KV-cache size grows with context length, so the maxTokens parameter also controls how much memory the session holds beyond the model weights themselves.

Apple Silicon and Unified Memory

The A16, A17, and A18 chips in iPhone 14 Pro through iPhone 16 all include a dedicated Neural Engine with substantial matrix multiplication throughput. The Neural Engine is optimized for low-precision operations, which aligns well with INT4 and INT8 inference on transformer layers.

The architecture advantage worth noting is unified memory. On conventional desktop hardware, the GPU has discrete VRAM and the CPU has system RAM; loading a model for GPU inference means copying weights across a bus. On Apple Silicon, CPU, GPU, and Neural Engine all address the same physical memory pool. A 2.5GB model loaded by the CPU runtime is the same physical allocation that the ANE and Metal shaders read during inference. There is no cross-bus copy, no double-buffering requirement for weight tensors. This reduces peak memory pressure and eliminates a category of latency that desktop GPU inference has to budget for.

How well LiteRT leverages the ANE versus the Metal GPU versus CPU depends on op-by-op delegation. Attention layers and large matrix multiplications go to ANE or Metal; operations with unsupported data layouts may fall back to CPU. The overall tokens-per-second throughput you see in practice is a function of how efficiently the runtime partitions the model graph across these backends.

Other Routes to the Same Destination

Google AI Edge is not the only path to on-device iPhone inference. Three others are worth comparing.

llama.cpp supports Metal acceleration on iOS and macOS via its ggml-metal.m backend. You can run GGUF-format Gemma models through llama.cpp on iPhone, but the developer story requires more integration work. There is no polished iOS SDK; you either wrap the C API yourself or use a third-party Swift wrapper.

MLC LLM from the CMU team uses Apache TVM to compile model graphs into device-specific kernels, including optimized Metal shaders for Apple hardware. Their approach generates code at compile time rather than relying on a generic runtime, which can yield better per-device throughput at the cost of a larger build pipeline. For 4B-class models on iPhone 15 Pro, published MLC benchmarks have shown 20 to 40 tokens per second, which is fast enough for interactive use.

Apple Intelligence, announced in 2024 and expanded since, runs its own on-device models via the Foundation Models framework. These are private, task-specific models optimized for Apple’s use cases (writing tools, summarization, system integrations) and are not accessible as a general inference API. You cannot point Apple’s on-device model at an arbitrary prompt. Gemma 4 via Google AI Edge fills the space Apple Intelligence deliberately leaves empty: a general-purpose instruction-following model you control, in your own app, with no Apple review of what the model is permitted to do.

Developer Implications

The practical friction for shipping a Gemma 4 app on iOS is not the inference code; it is the model file. A 4B INT4 model weighs around 2.5GB. Apple’s App Store has a cellular download limit (currently 200MB), which means a model this size requires on-demand resource loading via NSBundleResourceRequest or a first-launch download flow. The 1B INT4 variant at under 1GB is more tractable for distribution, at the cost of capability.

The offline-first property is more practically valuable than it might appear at first. An app that calls a cloud inference API fails in airplane mode, in low-coverage areas, and during API outages. Building inference locally removes network availability from the failure surface entirely. For applications where AI features are core rather than supplemental, that reliability difference matters.

The storage and distribution overhead is the remaining engineering problem. Once a model is cached on device, the inference latency for a 1B or 4B INT4 model on modern Apple Silicon is competitive with or better than a cloud API round-trip for many use cases, particularly short prompts where network latency dominates total response time.

What Changes When Models Fit on Phones

Gemma 4 running on iPhone is not a breakthrough in isolation. It is the current position of a trend that has been moving for two years: the smallest usable general-purpose instruction-following models keep shrinking in memory footprint while maintaining enough quality for real tasks. Each step in that progression expands the set of applications that can run inference without a server.

For Google, supporting on-device iPhone deployment is also a distribution play. Gemma models running in iOS apps means Gemma appears in contexts that Apple Intelligence does not reach and that OpenAI’s API requires network connectivity to serve. Open weights plus a maintained mobile runtime is a different kind of ecosystem presence than an API endpoint.

Was this interesting?