Why AMD's Lemonade Chose ONNX Over GGUF for Local LLM Serving

The local LLM server space has been dominated by llama.cpp and Ollama for long enough that the GGUF format started to feel like gravity: universal, assumed, inescapable. AMD’s Lemonade is betting against that assumption. It is an open source, OpenAI-compatible inference server built around pre-compiled ONNX artifacts and a hardware-aware backend selection engine. The goal is to extract real-world throughput from AMD’s Ryzen AI NPUs while falling back gracefully to iGPU, discrete GPU, or CPU when needed.

The project grew out of TurnkeyML, AMD’s model analysis and compilation benchmarking framework. Where TurnkeyML focused on the build and measurement pipeline, Lemonade adds the persistent server layer and the OpenAI-compatible REST API surface that integrations like Continue.dev, Open WebUI, and Cursor already speak natively. That lineage matters because it explains Lemonade’s core design choice: rather than shipping a runtime that compiles models on the user’s machine, AMD runs the compilation offline using their Quark quantization toolkit, uploads the resulting ONNX artifacts to their Hugging Face organization, and has Lemonade pull pre-built binaries at download time. You never wait through a multi-hour first-run compile. You also never bring your own GGUF.

What the Architecture Actually Looks Like

Lemonade’s backend stack has three layers. For NPU inference on systems with a supported Ryzen AI processor, the path is: ONNX Runtime plus onnxruntime-genai (Microsoft’s ONNX GenAI package) driving AMD’s AIE execution provider. For integrated GPU inference on Windows, it uses the DirectML execution provider over the same ONNX Runtime stack. For everything else, it falls back to llama.cpp with GGUF models. The backend selection happens at server startup based on hardware probing, not per-request.

The OpenAI-compatible surface is standard: /v1/chat/completions, /v1/models, /v1/completions, all on port 8000 by default, with streaming via SSE. Starting the server is:

pip install lemonade-server
lemonade serve --model phi-3.5-mini-instruct

Model management goes through AMD’s HuggingFace organization. The pre-quantized artifacts live under identifiers like amd/Phi-3.5-mini-instruct-awq-g128-int4-bf16. Quark applies post-training quantization, including AWQ, GPTQ, and SmoothQuant variants, then packages the result as ONNX files compatible with the specific AIE dataflow layout on AMD’s NPU tiles.

The NPU and What 50 TOPS Actually Constrains

AMD’s Ryzen AI NPU uses their XDNA and XDNA 2 architectures: a spatial dataflow design with AI Engine array tiles connected by programmable interconnect. The Ryzen AI 300 series (Strix Point) ships with XDNA 2 at 50 declared TOPS, which AMD prominently features in marketing. That number describes peak INT8 throughput under ideal conditions, but for LLM decode, the binding constraint is memory bandwidth, not TOPS.

The NPU has no dedicated DRAM. Model weights live in system RAM (DDR5 on Strix Point, running at roughly 100 to 150 GB/s of total bandwidth), and the NPU consumes them through the SoC memory bus while using only a small on-chip SRAM scratchpad for intermediate activations. For context, NVIDIA’s H100 has around 3.35 TB/s of HBM bandwidth. The NPU advantage over CPU is real, but it is bounded by that shared system memory bandwidth ceiling.

In practice, a Ryzen AI 9 HX 370 (Strix Point, XDNA 2) running Phi-3.5-mini at INT4 on the NPU produces roughly 25 to 35 tokens per second for decode. The same model on the integrated GPU (Radeon 890M, 16 CUs) gets around 15 to 20 tokens per second. CPU inference with AVX512 falls further behind. Those numbers put Lemonade on NPU broadly in the same territory as a mid-tier GPU running llama.cpp, which makes the performance story feel underwhelming until you look at the power draw.

The NPU running at those throughputs draws around 5 to 8 watts. The CPU draws 25 to 45 watts doing the same work. On a laptop running a local coding assistant in the background for hours, that gap is the difference between a full workday of battery life and stopping at lunch. That is the actual use case Lemonade is optimized for, and it is a legitimate one.

Model Size and the NPU Sweet Spot

The pre-compiled model catalog reflects this constraint clearly. The models AMD has published ONNX artifacts for skew small: Phi-3.5-mini (3.8B), Llama-3.2-3B and 1B, Qwen2.5 at 0.5B, 1.5B, and 3B. Anything larger than about 4 billion parameters exceeds what the NPU can serve efficiently given the bandwidth ceiling, and models are routed to the integrated GPU or CPU instead.

This is the sharpest friction point for users coming from Ollama or llama.cpp. Ollama’s model library contains hundreds of models across sizes from sub-1B to 70B, pulled as GGUF files from Ollama’s registry and served with whatever backend your hardware supports. With Lemonade, you are constrained to models AMD has specifically compiled and uploaded. If you want a 7B model, you are using the llama.cpp fallback path, not the NPU path, and at that point you are probably better served by just running Ollama.

Advanced users can run Quark themselves on arbitrary Hugging Face models and produce compatible ONNX artifacts, but that is not a beginner workflow. The Quark documentation covers the quantization pipeline, and the output drops into Lemonade’s model directory the same way the pre-built artifacts do. That escape hatch exists, but the onboarding story assumes you use AMD’s catalog.

Where Strix Halo Changes the Calculus

The Ryzen AI Max series, based on AMD’s Strix Halo SoC, represents a more interesting platform. It ships with up to 128 GB of unified memory shared between CPU, NPU, and an integrated GPU with up to 40 compute units of RDNA 3.5. The theoretical memory bandwidth from that iGPU configuration reaches around 500 GB/s, which is competitive with mid-range discrete GPUs and meaningfully higher than the DDR5 ceiling on standard Strix Point.

On Strix Halo, a 7B model running on the integrated GPU through Lemonade’s DirectML path becomes a genuinely fast option. The unified memory eliminates the PCIe transfer bottleneck that hurts discrete GPU configurations. A Ryzen AI Max 395 laptop can plausibly run Llama-3.2-7B at 20 to 30 tokens per second on the iGPU while drawing a fraction of what a discrete GPU system would consume. That combination of large unified memory and decent GPU bandwidth may be more significant for Lemonade’s practical utility than the NPU story.

Comparison with the Existing Ecosystem

Ollama remains the more capable general-purpose local server for most users. It has NPU support on exactly zero hardware configurations, but it has Metal acceleration on Apple Silicon, ROCm support for AMD discrete GPUs on Linux, CUDA for NVIDIA, and a model library several orders of magnitude larger than Lemonade’s current catalog. Its community has built integrations across hundreds of tools. If you are on a Mac or Linux system with a discrete GPU, there is no compelling reason to look at Lemonade.

For Windows users on Ryzen AI hardware, the calculus is different. Ollama on Windows does not touch the NPU at all. The power efficiency argument is specific to that combination of hardware and use case. If you are running a local coding assistant on a Ryzen AI laptop and you care about battery life more than model selection, Lemonade offers something Ollama cannot.

LM Studio and Jan both provide GUI-first experiences with broader hardware support, but neither targets the NPU either. vLLM is a different category entirely, aimed at multi-GPU production serving with PagedAttention, and irrelevant for laptop deployment.

The closest architectural parallel is actually Apple’s CoreML pipeline for the Apple Neural Engine: offline compilation of platform-specific model artifacts, tight hardware coupling, and constraints on which models are available. Apple’s ANE ecosystem benefits from years of mature tooling and a much larger device footprint, which sets a realistic ceiling for what AMD can achieve if they execute consistently.

What This Is Building Toward

Lemonade is early, and the model catalog limitations are a real obstacle to broad adoption. The architecture AMD has built, specifically the hardware-aware backend selection, the pre-compilation pipeline through Quark, and the clean OpenAI-compatible API surface, is sound. The TurnkeyML heritage gives it a credible foundation for benchmarking and optimizing across hardware targets systematically.

The missing piece is model breadth. AMD needs to automate the Quark pipeline enough to cover the top 50 models from the Ollama library in the 1B to 7B range, publish those artifacts to HuggingFace on a regular cadence, and make the custom quantization path accessible enough for technically motivated users to extend the catalog themselves. Until that happens, Lemonade is a compelling proof of concept for the NPU efficiency story, but an incomplete tool for everyday use.

For developers on AMD Ryzen AI hardware who run local LLM tooling continuously, the power efficiency argument is worth taking seriously. For everyone else, it is worth watching. AMD has the hardware infrastructure and the open source commitment to make this competitive. Whether they ship fast enough to build community before Ollama adds NPU support is an open question.