Gemma 4's Per-Layer Embeddings and What They Mean for On-Device Multimodal AI
Source: huggingface
Google’s Gemma 4 landed on Hugging Face this week with four model variants, full multimodal support across image, video, and audio, and an architectural pattern in its smallest models that deserves more attention than a standard release post can give it.
The headline benchmark numbers are strong. The 31B dense model scores 89.2% on AIME 2026 and 80.0% on LiveCodeBench v6, reaching a Codeforces ELO of 2150 and an LMArena score of roughly 1452. The 26B A4B mixture-of-experts variant matches it closely at 88.3% AIME and 77.1% LiveCodeBench while activating only 4B parameters per forward pass. These are competitive figures at their respective weight classes. But the technically interesting story in this release is the E-series: E2B with 2.3B effective parameters and E4B with 4.5B, both targeting on-device deployment with a transformer modification called Per-Layer Embeddings.
What Per-Layer Embeddings Actually Does
In a standard transformer, the embedding table is consulted once at the input layer. A token ID maps to a dense vector, that vector enters the residual stream, and all subsequent layers process context-dependent representations derived from it. Token identity is baked in at the start; by the time you reach deeper layers, the residual stream has accumulated so much contextual information that the original token signal is thoroughly diluted.
Per-Layer Embeddings (PLE) adds a parallel conditioning pathway. Each decoder layer receives its own dedicated lower-dimensional embedding vector for the current token, computed from a combination of an embedding lookup (the token-identity component) and a learned projection of the main embedding (the context-aware component). This vector modulates the hidden state via a lightweight residual block applied after attention and the feed-forward network.
The effect is a kind of dual-channel conditioning. The main residual stream carries what has happened so far in the sequence, while the per-layer embedding says, at every layer, here is what this token is without regard to context. Deeper layers in a standard transformer lose direct access to token identity; PLE restores it.
For multimodal inputs, positions occupied by image or audio tokens receive neutral per-layer signals via the pad token ID. The multimodal soft tokens merge into the embedding sequence before PLE is applied, so the architecture treats vision and audio positions as carrying no per-layer token-identity signal. This is a pragmatic choice: the relevant information for those positions comes from the vision or audio encoder outputs, not from the text vocabulary.
The parameter cost scales with vocabulary size multiplied by the per-layer embedding dimension, spread across all decoder layers. For a model targeting 2.3B effective parameters, this is a deliberate allocation: spend parameters on per-layer specialization rather than on deeper FFN widths or additional attention heads.
Compounding Efficiency: KV Sharing and Dual RoPE
PLE is not the only efficiency mechanism in the E-series. Gemma 4’s smaller models also use a shared KV cache across the final num_kv_shared_layers layers, where those layers reuse the key and value tensors computed by the last non-shared layer. For long-context inference, KV cache is often the binding memory constraint on consumer hardware. Sharing it across layers reduces that footprint without requiring a reduction in model depth or context length. Both E2B and E4B support 128k-token contexts, and the KV sharing is what makes that practical at edge hardware scales.
Attention alternates between local sliding-window layers (512-token windows on small models) and global full-context layers, a pattern established in Longformer and carried forward through Gemma 2. Gemma 4 adds a Dual RoPE configuration on top of this: standard RoPE for sliding layers, proportional RoPE for global layers. Proportional RoPE scales position encodings relative to the full context length, which is the mechanism that allows generalization to the 128k window without training on every possible length at equal density.
The combination of PLE, shared KV, alternating attention, and Dual RoPE is coherent: each piece addresses a different constraint (parameter utilization, memory bandwidth, compute per step, context generalization), and they compose without obvious conflicts.
Audio Only in the Small Models
Gemma 4’s audio support appears only in the E-series variants, not in the 31B or 26B A4B models. The audio encoder follows a USM-style conformer architecture, the same base used in Gemma 3n, and handles speech transcription, audio question answering, and multimodal function calls that include audio input. The E4B scores 35.54 on CoVoST translation; the E2B reaches 33.47 with acknowledged hallucination issues.
The decision to omit audio from the large models is interesting. The most likely explanation is training data scope: the audio encoder was trained on speech and the models are not expected to handle music or non-speech sounds at all. Pushing that encoder into a 31B model may not have met the quality bar Google set for the release. Alternatively, the large models are optimized for text and vision benchmarks where audio adds no signal, and the product decision was to keep the scope clean.
For anyone building voice-enabled applications on commodity hardware, the E4B variant is the most practical option in this family. For applications that require high-quality text reasoning with vision, the large variants offer considerably more headroom.
Comparing the Field
The sub-5B multimodal space has become competitive. Microsoft’s Phi-4-multimodal covers text, image, audio, and speech in a 5.6B model using a mixture-of-LoRA approach for cross-modal specialization. Qwen2.5-VL covers image and video with strong benchmark results but no audio path. Meta’s Llama 3.2 vision series reaches 11B and 90B but stops at image; audio support requires separate models.
Gemma 4 E4B’s 69.4% on MMLU Pro and 52.0% on LiveCodeBench v6 are respectable at 4.5B effective parameters, though Phi-4 and Qwen2.5-VL-7B score higher on several vision benchmarks at similar or larger parameter counts. The differentiation for Gemma 4 is the combination of factors: full audio support in the same model, Apache 2.0 licensing with no usage restrictions, day-zero support across llama.cpp, MLX, transformers.js, Mistral.rs, and ONNX, and configurable image token budgets that let developers trade quality against context space explicitly.
The Apache 2.0 license is not incidental. Meta’s Llama license, despite being labeled open, carries usage restrictions at scale. Qwen licenses vary by model version. For commercial deployments that need to move fast without legal review cycles, Apache 2.0 simplifies the calculus considerably.
The MoE Model’s Position
The 26B A4B variant applies mixture-of-experts to a different problem than the E-series. Rather than fitting capability into edge hardware, it provides near-31B quality at lower inference compute on server hardware. On MMLU Pro, the gap is 82.6% versus 85.2% for the 31B dense model; on AIME 2026, it is 88.3% versus 89.2%. The LMArena scores are 1441 and 1452, a difference that is close to noise in human evaluation.
For deployments where total model weight fits in memory but compute throughput is constrained, activating 4B parameters per forward pass versus 31B is a meaningful efficiency gain. MoE at this scale is established practice, demonstrated by Mixtral 8x7B and DeepSeek’s series, but the Gemma 4 numbers suggest Google has the routing calibrated well. The 256k context window (twice the E-series) also makes the large models more relevant for document-heavy workloads.
Using the E2B Model
The any-to-any pipeline in transformers provides the simplest entry point for the E-series:
from transformers import pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-e2b-it")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "What is in this image?"},
],
}
]
output = pipe(messages, max_new_tokens=200, return_full_text=False)
print(output[0]["generated_text"])
For Apple Silicon, MLX support is available via mlx-vlm with TurboQuant at KV bits 3.5, which Google claims produces approximately 4x speed improvement at similar accuracy to full precision. The llama.cpp GGUF support was available on release day, and the ONNX checkpoints allow deployment through transformers.js for browser inference via WebGPU.
One deployment knob worth noting: image token budgets are configurable at 70, 140, 280, 560, or 1120 tokens per image. At 70 tokens against a 128k context window, you can process over 1800 images in a single context; at 1120, the number drops to roughly 113. For applications that need to compare many images, or for RAG systems that embed visual content, this budget is a real architectural control rather than just a quality dial.
The Bigger Picture
Gemma 4’s E-series architecture reads as a considered response to a specific engineering problem: how to pack genuine multimodal capability into 2-5B effective parameters without degrading to a model that is nominally multimodal but practically text-only. The Per-Layer Embeddings mechanism is the unusual piece, and it suggests that Google’s research team thinks direct per-layer token-identity conditioning is worth the parameter cost at small scales in a way that it isn’t at 31B.
Whether PLE’s benefits generalize or turn out to be a narrow win for the specific size regime where Gemma 4 E-series operates is something that will become clearer as the community evaluates these models on tasks beyond the standard benchmarks. The architecture is open, the weights are Apache 2.0, and the tooling support was coordinated for day-zero availability. The conditions for serious evaluation are in place.