Google’s Gemma 4 can now run entirely on an iPhone without a network connection, according to a report making the rounds on Hacker News. This is genuinely impressive, and the 294 upvotes and 180 comments the story collected suggest the community agrees it is worth paying attention to. But the framing of “native iPhone AI” glosses over a hardware access story that explains why open-weight models consistently trail Apple Intelligence on identical silicon, regardless of how good the model weights actually are.
The Gemma lineage and what Gemma 4 represents
To understand why this matters, it helps to trace where Gemma 4 comes from. Google released Gemma 1 in February 2024 with 2B and 7B parameter variants, positioning them as lightweight open-weight models derived from Gemini’s training. Gemma 2 arrived in June 2024 with 2B, 9B, and 27B sizes, introducing interleaved local and global attention layers and using knowledge distillation from larger models to boost the smaller variants’ quality beyond what raw parameter count would suggest.
Gemma 3 launched in early 2025 with a 1B model explicitly designed for on-device deployment, alongside 4B, 12B, and 27B variants. The 1B and 4B sizes supported multimodal inputs and a 128K context window, which was ambitious for edge hardware. Gemma 4 continues this trajectory. Where earlier iterations focused on establishing the architecture and quantization pipeline, Gemma 4 appears to push the edge-deployment story further, with the iPhone now receiving first-class support alongside Android.
The on-device inference work across all of these releases runs through the same foundation: Google’s MediaPipe LLM Inference API, now backed by LiteRT (the rebranded version of TensorFlow Lite). This is the engineering infrastructure that actually makes “runs on iPhone” possible, and it is where the performance trade-offs live.
The runtime: LiteRT and Metal, not Core ML
When Google says Gemma 4 runs natively on iPhone, it means the model executes locally using Google’s own runtime stack. On iOS, MediaPipe’s LLM Inference task routes computation to the GPU via Metal. The Swift integration looks roughly like this:
import MediaPipeTasksGenAI
let options = LlmInference.Options(modelPath: modelPath)
options.maxTokens = 1024
options.topK = 40
options.temperature = 0.8
let llmInference = try LlmInference(options: options)
// Streaming generation
try llmInference.generateResponseAsync(inputText: prompt) { partial, error in
if let text = partial { print(text, terminator: "") }
}
The CocoaPods dependency is MediaPipeTasksGenAI and MediaPipeTasksGenAIIOS. Swift Package Manager support is available through Google’s official package index. The model itself is distributed as a .task file, a FlatBuffer bundle containing quantized weights, tokenizer vocabulary, and inference metadata in a single artifact. For a model in the 1B to 4B parameter range at 4-bit quantization, this bundle is typically 600 MB to 2 GB.
Critically, this runtime does not use Core ML or Apple’s Neural Engine. Google’s stack routes work to Metal, which means it runs on the GPU portion of the A-series chip. The Apple Neural Engine, present in every iPhone since the A11 Bionic and heavily expanded in each subsequent generation, sits idle during Gemma inference.
Why the Neural Engine matters for token throughput
The Apple Neural Engine (ANE) in the A17 Pro delivers approximately 35 trillion operations per second. More importantly, it is specifically designed for the matrix-multiply and activation patterns that dominate transformer inference. Apple Intelligence, which Apple’s own on-device models use, routes through Core ML and can saturate the ANE, which is why Apple’s system model achieves the throughput it does on A17 and later devices.
LLM inference at decode time is memory-bandwidth-bound, not compute-bound. The GPU on the A17 Pro has roughly 68 GB/s of memory bandwidth to work with. The ANE has dedicated, higher-efficiency pathways to unified memory that are not exposed to third-party developers. Apple has not opened an API for third-party code to dispatch workloads directly to the ANE at the level of granularity that would let an LLM inference runtime take full advantage of it. Core ML can route there, but Core ML imposes its own model format requirements and doesn’t accommodate the kind of dynamic, speculative execution patterns that modern LLM runtimes use.
Some community developers have converted Gemma 2 weights to Core ML format to get partial ANE access, and the speed improvement is measurable, but the conversion path loses some of the optimization work that Google’s LiteRT pipeline applies, and it is not an officially supported route. The MLC-LLM project takes yet another angle, using Apple Metal directly but with a more aggressive compilation strategy that can extract better GPU utilization than MediaPipe’s current implementation. MLC-LLM’s Gemma benchmarks on A17 Pro show decode speeds roughly 20 to 30 percent higher than MediaPipe for equivalent quantization levels, which suggests the Metal path still has headroom that the official tooling hasn’t fully claimed.
Quantization: what INT4 is buying and what it costs
Gemma 4 on iPhone uses 4-bit integer quantization. The weights are stored as INT4 values, while activations remain in FP16 during computation. This reduces the model’s storage footprint by approximately 75 percent compared to full FP16 weights and reduces the memory bandwidth required during inference by a similar factor.
For a 2B parameter model, INT4 quantization brings the weight storage to roughly 1 GB, fitting comfortably within the 6 to 8 GB of unified memory on current iPhones while leaving room for the OS, app, and the key-value cache needed for longer contexts. A 4B model at INT4 lands around 2 GB.
The quality cost of INT4 quantization at this scale is real but bounded. Google uses a group quantization scheme where weights are quantized in small blocks (typically 32 to 128 weights per group), with each group having its own scale factor. This substantially reduces the quantization error compared to naive per-tensor quantization. The result is that a well-quantized 2B INT4 model is competitive with a full-precision model of slightly smaller effective capacity, though not identical to the FP16 original.
At 1B parameters, INT4 models on mobile are genuinely constrained. Simple question answering, summarization, and short text generation work well. Multi-step reasoning, complex instruction following, and tasks requiring broad factual recall expose the limitations of both the model size and the precision loss more readily. The 4B variant navigates this more gracefully and is likely the more practical target for most real-world iOS applications.
The performance picture on current hardware
For context, here is a rough comparison of on-device LLM throughput on current iPhone hardware at INT4 quantization, drawing from published benchmarks and community measurements:
| Model | Size | Runtime | Decode speed (iPhone 16 Pro) |
|---|---|---|---|
| Apple Intelligence | ~3B | Core ML + ANE | ~35-40 tok/s |
| Gemma 4 (4B) | 4B | MediaPipe / Metal | ~12-18 tok/s |
| Gemma 4 (1B) | 1B | MediaPipe / Metal | ~25-35 tok/s |
| Llama 3.2 3B | 3B | llama.cpp / Metal | ~15-20 tok/s |
| Phi-3.5 Mini | 3.8B | llama.cpp / Metal | ~12-16 tok/s |
Apple Intelligence is roughly two to three times faster at decode than a comparable open-weight model on the same device, and the primary explanation is not model quality or architecture. It is the ANE access that third-party developers cannot get.
Thermal throttling compounds this. Sustained inference sessions push the GPU and generates heat, and A-series chips throttle within five to ten minutes of continuous load. Apple Intelligence, running partially on the ANE’s more efficient compute path, throttles less aggressively under sustained use.
What this means for developers building on-device AI features
For developers considering on-device LLM integration, Gemma 4 on iPhone is a meaningful option that did not exist in this form two years ago. A fully offline model that ships with the app, requires no API keys, and handles user data without a network hop is architecturally appealing for a wide range of features: local summarization, writing assistance, offline classification, private conversational interfaces.
The practical ceiling is the decode throughput. At 12 to 18 tokens per second for a 4B model, responses arrive visibly word by word but without the snappiness users expect from cloud API calls that can exceed 100 tokens per second. Streaming output with a typing indicator mitigates this perceptually, and first-token latency matters more than sustained throughput for most conversational use cases.
Prefill performance, the time to process the input prompt before generation starts, scales roughly with sequence length and is fast for short prompts but degrades noticeably for long context inputs. Asking Gemma 4 to summarize a document by pasting its full text into the prompt will have a perceptible pause before the first word appears.
The right mental model is not “this is like a cloud LLM that happens to run locally.” It is a capable but constrained inference engine that works well for use cases scoped to fit the hardware. That scoping is a real constraint, but the privacy and latency benefits of full offline inference are correspondingly real as well.
The broader open-device AI question
The interesting systemic tension here is between Google’s increasingly strong open-weight model strategy and Apple’s control over the highest-performance inference paths on its own hardware. Google benefits when capable open models run well on iPhones, because this positions Android’s more open ecosystem as a natural companion: the same models run on Android with more flexible hardware access, better CoreML-equivalent APIs from Google (since it controls the OS), and emerging support for dedicated on-device NPU paths.
Apple’s restriction of ANE access to Core ML is not irrational. Opening direct ANE dispatch to arbitrary code would create a large attack surface and complicate Apple’s ability to manage thermal, memory, and battery behavior. But it does mean that third-party LLMs are permanently disadvantaged relative to Apple’s own models on Apple’s own hardware, regardless of how good those models become.
For now, Gemma 4’s offline inference on iPhone is a real capability worth paying attention to. It’s fast enough for practical use, it fits in device memory, and the MediaPipe integration is production-quality. The gap to Apple Intelligence isn’t about model quality. It’s about which parts of the chip Google is allowed to use.