· 7 min read ·

Autonomous Tool Use at the Edge: What the Gemma 4 VLA Demo Actually Shows

Source: huggingface

There is a specific moment in the NVIDIA Gemma 4 VLA demo worth paying attention to: the model decides on its own whether to photograph the scene before answering. There is no keyword matching, no explicit “look” command, no hardcoded branch in the code. The model reasons about the query, determines that visual context would be useful, and calls a tool. That is a different thing from a chatbot that happens to have a camera button.

This post is about what enables that pattern on a device that costs $249.

The Hardware Floor

The Jetson Orin Nano Super is NVIDIA’s entry-level edge AI module, released in early 2025. The 8GB variant delivers 67 TOPS of INT8 AI performance using an Ampere GPU with CUDA Compute Capability 8.7 (SM87). That SM87 designation matters for the build: llama.cpp needs to be compiled with -DCMAKE_CUDA_ARCHITECTURES="87" to generate the right PTX.

The module sits at the low end of the Orin family, but the “Super” refresh pushed it meaningfully above the previous Orin Nano. For running a quantized 5B-parameter multimodal model, 8GB of shared memory is tight. The demo works around this with an 8GB swap file and Q4_K_M quantization, which keeps the model weights in roughly 3.5GB while leaving room for the vision projector and KV cache.

Gemma 4 E2B: Architecture Choices That Matter Here

The model in question is Gemma 4 E2B, Google’s 5.1B-parameter multimodal model. The “E2B” stands for “Efficient 2B” — the effective active parameter count is 2.3B due to Per-Layer Embeddings (PLE), which amortizes parameter cost across depth rather than width. Total parameter count is 5.1B when you include the 150M vision encoder and 300M audio encoder.

A few architectural decisions make this viable at the edge:

Hybrid attention. Gemma 4 uses a combination of sliding window attention (512-token window) and global attention layers. Most tokens only attend locally, which reduces the memory bandwidth requirement for long contexts without abandoning the ability to reason across a full 128K-token window when needed. For an 8GB device running with a 2048-token context (as configured in this demo), this is mostly academic, but it means the model was designed with constrained deployment in mind from the start.

Proportional RoPE (p-RoPE). This is Google’s approach to handling variable-resolution images without blowing up the sequence length. Rather than naively tokenizing a high-resolution image into thousands of tokens, p-RoPE encodes spatial position proportionally, letting the model handle different resolutions with a fixed token budget.

Fixed image token budget. The llama.cpp server in this demo is launched with --image-min-tokens 70 --image-max-tokens 70, pinning the vision representation to exactly 70 tokens regardless of input resolution. This is conservative (the model supports budgets up to 1120 tokens) but it keeps memory usage predictable and inference latency manageable on constrained hardware. At 70 tokens, the webcam frame is being compressed aggressively; this is enough for object recognition and scene description but not for reading small text or fine-grained spatial reasoning.

The Tool-Calling Stack

The autonomy in this demo comes from llama.cpp’s --jinja flag, which enables native function-calling via Jinja template rendering. The model is given a single tool definition:

{
  "name": "look_and_answer",
  "description": "Take a photo with the webcam and analyze what is visible."
}

When the model generates a response, it can emit a structured tool call that the Python wrapper intercepts. The wrapper captures a frame from the webcam using OpenCV, base64-encodes it, and includes it in a follow-up request to the llama.cpp server’s OpenAI-compatible /v1/chat/completions endpoint. The model then generates its final response with visual context.

The full server launch looks like this:

~/llama.cpp/build/bin/llama-server \
  -m ~/models/gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj ~/models/mmproj-gemma4-e2b-f16.gguf \
  -c 2048 \
  --image-min-tokens 70 --image-max-tokens 70 \
  --ubatch-size 512 --batch-size 512 \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 --flash-attn on \
  --no-mmproj-offload --jinja -np 1

Some flags deserve explanation. -ngl 99 offloads all model layers to the GPU (99 is effectively “as many as exist”). --flash-attn on enables Flash Attention, which reduces the memory footprint of attention computation by avoiding materializing the full attention matrix. --no-mmproj-offload keeps the vision projector weights on the GPU rather than streaming them, which avoids a transfer bottleneck when processing images. The vision projector is the mmproj-gemma4-e2b-f16.gguf file, stored in full FP16 precision since it is a relatively small component (around 300MB) and quantizing it tends to degrade visual quality noticeably.

The --jinja flag is what threads everything together. Without it, the model can still generate text that looks like a function call, but the server will not parse it into a structured tool response. With Jinja enabled, the server handles the chat template natively and returns tool calls in the format the OpenAI client libraries expect.

Voice as the Interface Layer

The demo wraps the language model in a voice pipeline: NVIDIA Parakeet for speech-to-text and Kokoro for text-to-speech. Parakeet is NVIDIA’s ASR model, optimized for transcription quality on short utterances. Kokoro is a compact TTS model (82M parameters) that produces reasonably natural speech without requiring a server-class GPU.

The interaction model is push-to-talk: SPACE to start recording, SPACE again to stop and process. The first run downloads both models and pre-generates the Kokoro voice prompts, which adds about a minute of initialization. Subsequent runs start immediately.

This voice layer is important because it shifts the demo from a developer tool into something that resembles an actual edge AI assistant. The question asked aloud, the model reasoning about whether it needs to see the environment, the webcam activating, the spoken response — that flow is qualitatively different from a text interface, even if the underlying inference is identical.

What “VLA” Means Here Versus Robotics

Vision-Language-Action models in the robotics literature usually refer to models that output motor commands or manipulation policies — systems like RT-2 or OpenVLA, where the model controls a physical robot. That is not what this demo does. The “action” here is tool invocation: the model decides to call a function that captures a webcam frame.

That distinction is worth being clear about. This is a vision-language model with tool-calling capability running at the edge, not a manipulation policy. The VLA framing is aspirational rather than precise. What the demo actually demonstrates is that autonomous multimodal reasoning, where the model decides which modalities to use rather than receiving them all unconditionally, is now achievable on a $249 embedded board.

The gap between this and full robotics VLA is non-trivial. A manipulation policy needs to output continuous motor commands at high frequency with low latency, which requires either a much faster inference path or a separate low-level controller. The Jetson Orin Nano Super could plausibly run the language/vision reasoning as a high-level planner while a separate microcontroller handles real-time actuation, which is actually how most production robotics systems are architected anyway.

The Quantization Trade-off

Q4_K_M is a 4-bit quantization scheme from llama.cpp that uses K-quants: different layers get different bit widths based on their sensitivity to quantization error. The “M” variant applies medium-precision quantization to attention and feed-forward weights, with higher-precision treatment for the most sensitive layers. This typically costs 1-3% on benchmarks compared to BF16 but cuts memory by roughly 75%.

For a model with MMLU Pro at 60.0% and GPQA Diamond at 43.4% in BF16, the quantized version is still a capable reasoner. The vision benchmark (MMMU Pro at 44.2%) is what matters most for the VLA use case, and 70 image tokens is the more significant limitation there than quantization precision.

The demo notes Q3_K_M as a fallback if 8GB proves insufficient. Q3 is noticeably more degraded than Q4 on reasoning tasks, so it is genuinely a last resort rather than a recommended alternative.

Building It

The llama.cpp compilation for SM87 requires a CUDA-aware build:

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87" \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4

The -j4 is appropriate for the Orin Nano’s 6-core Arm Cortex-A78AE CPU; building with more parallelism will exhaust memory. The GGUF model files come from Hugging Face, and there are over 131 quantized variants available.

The Python demo itself is a single file, Gemma4_vla.py, available at github.com/asierarranz/Google_Gemma. The dependencies are sounddevice for audio I/O, OpenCV for webcam capture, and the standard OpenAI Python client for llama.cpp communication.

Why This Matters for Edge Deployment

The interesting shift here is not that a 5B model runs on a Jetson. That has been possible with llama.cpp for a while. The shift is that a 5B model can now run natively multimodal inference, make autonomous decisions about which tools to invoke, and do this with a deployment stack that consists of a compiled llama.cpp binary, a GGUF file, and a 200-line Python script.

The alternative approach, calling a cloud vision API when the user says “look”, would have lower latency on good hardware but fails in offline or latency-sensitive environments. For industrial inspection, assistive robotics, or embedded monitoring systems, the ability to run the full reasoning stack locally, without a network dependency, changes what is feasible.

Gemma 4’s Apache 2.0 license makes it commercially deployable without royalties, which is the other half of the equation. The model, the runtime, and the tools are all open. The hardware is commodity. The remaining constraint is the engineering work to integrate this pattern into a real application, and that is a constraint that tends to shrink over time.

Was this interesting?