When malloc Is Not an Option: Allocation Strategies for Constrained Systems
Source: lobsters
Bill Hall’s memory allocation strategies series is often read as a deep-cut performance post, something to reach for when profiling reveals that malloc is a bottleneck. That framing misses half the audience. For embedded and safety-critical programmers, this series is required reading not because it makes allocation faster but because general-purpose allocators are not on the table at all.
MISRA-C 2012 Rule 21.3 prohibits the use of malloc, calloc, realloc, and free in safety-critical C code. CERT C coding standard MEM35-C and DO-178C guidance for avionics software carry similar restrictions. The rationale is consistent across all of them: dynamic allocation introduces unbounded latency, fragmentation that accumulates over time, and failure modes that are difficult to reason about deterministically. When the software controls an aircraft flight management system or an automotive brake controller, “the heap ran out after two weeks of uptime” is not an acceptable failure mode.
The consequence is that anyone writing firmware, RTOS-based embedded software, or safety-critical C must understand what each allocation strategy actually does, because they will be implementing one themselves.
FreeRTOS as a Case Study
FreeRTOS ships five heap implementations numbered 1 through 5. They are not arbitrary variants; they map almost exactly to the strategies the gingerbill series covers, in order of increasing complexity.
heap_1.c is a pure bump allocator. It allocates by advancing a pointer into a statically declared array. vPortFree() is a stub that does nothing. Memory is never reclaimed. This is suitable for systems where tasks and queues are created at startup and then run forever: space is consumed once at initialization and the system runs indefinitely without further allocation.
heap_2.c adds a free list without coalescing. Freed blocks return to a singly linked list sorted by block size. Allocation walks the list for a best-fit block. This enables dynamic task creation and deletion but accumulates fragmentation over time since freed blocks of different sizes are never merged. FreeRTOS’s own documentation explicitly warns that heap_2 should not be used if allocation and deallocation sizes vary widely.
heap_4.c adds coalescing. When a block is freed and its neighbors are also free, they merge into a single larger block. This is the free list allocator from the gingerbill series’ fifth article, and it bounds fragmentation considerably better than heap_2 at the cost of a slightly more complex free operation.
heap_5.c extends heap_4 to span multiple non-contiguous memory regions, which matters when a microcontroller exposes separate SRAM banks at different addresses.
None of these are novel; they are textbook strategies. The FreeRTOS implementations are each under 300 lines of C. What matters is that the right choice depends entirely on the allocation pattern the application produces, and you cannot make that choice without understanding what each strategy costs.
The Pool Allocator Is the Workhorse
For most embedded applications, pool allocators do the heavy lifting. A pool allocator manages a fixed-size backing buffer divided into equal-sized chunks:
typedef struct {
uint8_t *buf;
size_t chunk_size;
size_t capacity;
void *free_head;
} Pool;
void pool_init(Pool *p, void *buf, size_t buf_len, size_t chunk_size) {
p->buf = buf;
p->chunk_size = chunk_size;
p->capacity = buf_len / chunk_size;
p->free_head = NULL;
/* link all chunks into free list */
for (size_t i = 0; i < p->capacity; i++) {
void **chunk = (void **)(p->buf + i * chunk_size);
*chunk = p->free_head;
p->free_head = chunk;
}
}
void *pool_alloc(Pool *p) {
if (!p->free_head) return NULL;
void *chunk = p->free_head;
p->free_head = *(void **)chunk;
return chunk;
}
void pool_free(Pool *p, void *chunk) {
*(void **)chunk = p->free_head;
p->free_head = chunk;
}
The trick at the center of this is that the free list pointer lives inside the unallocated chunk itself. You need chunk_size >= sizeof(void *) for this to work, but that is almost always satisfied. There is zero metadata overhead per allocation, no fragmentation (every free chunk can satisfy any request), and both operations are O(1).
A packet buffer allocator for a network driver is a canonical use case. Ethernet frames have a maximum size of 1518 bytes; you know upfront how many frames the system needs to buffer concurrently; a static pool of fixed-size packet buffers eliminates both fragmentation and latency variance on the receive path. The lwIP network stack uses exactly this approach through its pbuf pool system.
Alignment Is Not Optional
Every custom allocator has to solve alignment, and it is the most common source of subtle bugs in hand-rolled implementations. Modern architectures do not fault on unaligned loads for most types, but they take a performance penalty, and some SIMD intrinsics require 16 or 32-byte alignment. Accessing an unaligned double on ARMv6 produces a bus fault.
The canonical solution is the power-of-two round-up:
static uintptr_t align_forward(uintptr_t ptr, size_t align) {
/* align must be a power of two */
uintptr_t a = (uintptr_t)align;
uintptr_t mod = ptr & (a - 1); /* ptr % align since align is power of two */
if (mod != 0) ptr += a - mod;
return ptr;
}
The & (align - 1) mask works because powers of two in binary are a 1 followed by zeros; subtracting 1 produces a mask of all the lower bits. This is two instructions on any architecture. The linear allocator in the gingerbill series applies this on every allocation and tracks both the current offset and the previous offset, enabling one-shot resize of the most recent allocation without invalidating the pointer.
For pool allocators, alignment only needs to be handled during initialization since all chunks share the same alignment. For stack allocators, it must be applied per-allocation since the stack pointer after each allocation depends on the size of the previous one.
The Buddy System and Bounded Fragmentation
The strategies above are all-or-nothing on fragmentation. Arena and pool allocators have no fragmentation at all by construction, but they sacrifice generality. Free list allocators support arbitrary allocation patterns but accumulate external fragmentation. The buddy system allocator sits between them.
Buddy allocation divides memory into power-of-two-sized blocks. Each block of size $2^k$ has a partner (the “buddy”) at a fixed offset: the partner of the block starting at address p with size s is at address p XOR s. To allocate n bytes: round n up to the next power of two, find a free block of that size, and if none exists, split a larger block recursively. To free: check whether the buddy is free; if so, merge them and repeat upward.
The key property is that fragmentation is bounded. External fragmentation cannot exceed a factor of two, because any unused space adjacent to an allocation is its buddy, and buddies coalesce. Internal fragmentation (rounding up to the next power of two) can be up to 50% but is bounded. The Linux kernel has used a buddy allocator for physical page allocation since at least the 2.0 era, and the implementation in mm/page_alloc.c has been refined for decades. The kernel also maintains per-order free lists (one list per power-of-two block size) so that allocation is O(log n) where n is the number of orders, not the number of free blocks.
For embedded systems that need to manage a medium-sized RAM region with unpredictable but bounded allocation sizes, the buddy system is often the right choice over a free list: the coalescing behavior is deterministic, the maximum fragmentation is known at design time, and the worst-case allocation time is predictable.
Static Allocation as the Terminal Case
Pushed to the extreme, embedded systems often avoid dynamic allocation entirely by declaring all storage statically:
#define MAX_CONNECTIONS 32
static Connection connection_pool[MAX_CONNECTIONS];
static bool connection_used[MAX_CONNECTIONS];
This is just a pool allocator with the backing array in BSS and the free list replaced by a bitmap. It trades flexibility for complete determinism: you know at compile time how much memory the system uses, and that number does not change at runtime. Stack usage can be analyzed statically; tools like StackAnalyzer from AbsInt or LLVM’s -stack-usage output make this tractable. For DO-178C Level A software, static analysis of worst-case stack depth is a certification requirement, and dynamic allocation makes it impossible.
The MISRA prohibition is not conservatism for its own sake. Heap allocators introduce state that is genuinely difficult to analyze statically. Once you remove them, the entire memory layout of the system becomes deterministic, and verification tools can make stronger guarantees.
What the Series Is Actually Teaching
The gingerbill series presents these five strategies as a pedagogical progression, but the real lesson is that each strategy makes a different trade between generality and analyzability. The bump allocator is the most analyzable: given the buffer size and the allocation pattern, you can prove statically that it will not exhaust memory. The free list allocator is the least analyzable: worst-case fragmentation depends on the runtime sequence of allocations and frees in ways that require dynamic analysis.
When you are forced off malloc by a safety standard, a resource-constrained microcontroller, or a real-time latency requirement, this trade-off becomes concrete. The question is not “which allocator is fastest” but “which allocator’s failure modes can I reason about statically given my application’s allocation pattern.”
For most embedded workloads, the answer is a combination: a bump allocator for system initialization, pool allocators sized for each object type in the application, and no general-purpose allocator at all. The gingerbill series is a clean, complete reference for implementing each of these from scratch in C, with correct alignment handling and the in-place free list trick. In contexts where you cannot delegate allocation to a library, that knowledge is not optional.