The history of real-time rendering culling is a history of moving work to where the data already lives. For most of the 3D graphics era, culling was a CPU job: walk a scene graph, test bounding boxes against the view frustum, skip draw calls for invisible objects. That worked fine when scenes had hundreds of meshes and the CPU was the only thing smart enough to make visibility decisions. Neither of those conditions holds anymore.
This is a topic the krupitskas.com post on modern culling techniques covers in solid technical detail, so rather than repeat that ground, I want to focus on the architectural shift underneath all of it: why culling had to move to the GPU, and what that changes about how you think about geometry budgets and pipeline design.
Why CPU Culling Broke Down
CPU-side frustum culling works by testing each object’s axis-aligned bounding box (AABB) or bounding sphere against the six planes of the view frustum. The test itself is cheap: for a sphere with center c and radius r, you check the signed distance from c to each plane and reject if any distance is less than -r. For AABBs you test each corner against each plane, or use the half-extents trick:
bool frustumTestAABB(const Frustum& f, glm::vec3 center, glm::vec3 extents) {
for (int i = 0; i < 6; i++) {
float r = glm::dot(glm::abs(glm::vec3(f.planes[i])), extents);
float d = glm::dot(glm::vec3(f.planes[i]), center) + f.planes[i].w;
if (d + r < 0.0f) return false;
}
return true;
}
The problem is not the per-object cost. It is the total count. A modern open-world scene might have millions of unique mesh instances, dense vegetation with per-blade geometry, or procedurally placed rocks across kilometers of terrain. The CPU cannot iterate that list at 60 Hz without dedicating multiple cores entirely to visibility, and even then the overhead of building and issuing draw calls for surviving objects saturates the graphics API driver before the GPU ever gets busy.
The deeper issue is that CPU culling operates at the wrong granularity. You cull whole objects, but objects are not the atomic unit of GPU work anymore. A single mesh with 500,000 triangles might have 90% of its surface behind the camera or occluded by something closer. CPU culling passes it through as visible because its bounding box intersects the frustum. The GPU then processes all 500,000 triangles and discards most of them in the rasterizer. That is wasted work, and rasterizer throughput is expensive.
Meshlets Change the Unit of Culling
The solution is to break meshes into small fixed-size clusters of triangles called meshlets, then cull at the meshlet level rather than the object level. A typical meshlet contains 64 to 128 vertices and up to 128 primitives. The specific numbers matter because they align with GPU wavefront or warp sizes, letting the GPU process exactly one meshlet per thread group without padding waste.
With meshlets you can run two culling passes that were not possible before. The first is frustum culling per meshlet using the meshlet’s own tight bounding sphere. The second, more interesting pass is cone culling for back-face rejection.
Each meshlet precomputes a cone that encloses all the surface normals of its triangles. If the dot product of the cone axis with the vector from the camera to the meshlet is greater than the cosine of the cone half-angle, every triangle in that meshlet is back-facing relative to the camera. You can reject the entire meshlet without rasterizing a single triangle:
bool coneCull(vec3 center, float radius, vec3 coneAxis, float coneCutoff, vec3 cameraPos) {
return dot(center - cameraPos, coneAxis) >=
coneCutoff * length(center - cameraPos) + radius;
}
This is precomputed offline using tools like meshoptimizer, which also handles the meshlet partitioning. Meshoptimizer’s meshopt_buildMeshlets function constructs the clusters and meshopt_computeMeshletBounds generates the bounding sphere and cone data. The quality of the clustering matters: good spatial locality within each meshlet maximizes cone culling effectiveness because triangles with similar orientations end up grouped together.
On modern hardware this whole pipeline runs through mesh shaders. Task shaders (amplification shaders in DirectX 12 parlance) receive one thread per meshlet and decide whether to emit that meshlet for rasterization. Surviving meshlets get passed to the mesh shader stage, which outputs the actual triangle data. The GPU processes the culling entirely on-chip, with no CPU involvement and no readback.
Hierarchical Z Occlusion
Frustum and cone culling handle visibility against the view volume and surface orientation, but neither addresses occlusion: objects that are within the frustum and front-facing but hidden behind something closer. Occlusion culling has historically been the hard problem because it requires knowing what has already been drawn.
The standard GPU approach is a hierarchical Z-buffer, usually called Hi-Z. You downsample the depth buffer from the previous frame into a full mipmap chain, where each mip level stores the maximum depth value within each 2x2 block of the level below. The highest mip covers the entire screen as a single texel containing the overall maximum depth in the scene.
To test a candidate object, you project its bounding sphere or AABB into screen space, find the mip level whose texel size matches the projected footprint, and sample the maximum depth at that level. If the object’s near depth is greater than the sampled maximum, every pixel it would cover is already occupied by something closer, and you can safely reject it.
The hierarchical structure is critical for performance. Without it you would need to sample the full-resolution depth buffer for every object, which produces terrible cache behavior as the samples scatter across a 4K texture. The mip chain lets small objects query a coarse level where the samples cluster together, while large objects use finer levels for precision. Cache coherence follows naturally.
Using the previous frame’s depth buffer introduces a one-frame lag, which means fast-moving newly visible objects can be dropped for a single frame. In practice this is almost never visible because objects do not teleport; the error corrects itself within one frame as the object appears in the current frame’s depth buffer.
Two-Phase Occlusion Culling
The most sophisticated GPU-driven approach combines Hi-Z with a two-phase render loop to handle the chicken-and-egg problem: you need a depth buffer to do occlusion culling, but you need to have rendered something to have a depth buffer.
The original formulation of this technique comes from work by Ulrich Haar and Sebastian Aaltonen at Ubisoft, presented at SIGGRAPH 2015. The structure is:
Phase 1: Start with the depth buffer from the previous frame. Run GPU-side frustum and Hi-Z occlusion culling. Draw all meshlets that survive. This produces a depth buffer that is mostly correct for the current frame.
Phase 2: Run a second culling pass using the depth buffer just written. This pass only processes objects that were rejected in phase 1. Some of them will now be visible because the phase 1 render has already updated the depth buffer. Draw those survivors.
The end result is that virtually no visible geometry gets dropped. Objects that were incorrectly rejected in phase 1 due to the stale previous-frame depth buffer get a second chance in phase 2. The total GPU time is dominated by phase 1 since most geometry is consistent frame to frame; phase 2 only picks up stragglers.
In pseudocode the loop looks like this:
// Build Hi-Z from previous frame depth buffer
buildHierarchicalZ(prevDepth);
// Phase 1: cull against previous frame's Hi-Z
for each meshlet in scene:
if frustumTest(meshlet) && !occluded(meshlet, prevHiZ):
emit meshlet for draw
drawPhase1();
// Build Hi-Z from current phase 1 depth
buildHierarchicalZ(currentDepth);
// Phase 2: re-test phase 1 rejects against current Hi-Z
for each meshlet rejected in phase 1:
if !occluded(meshlet, currentHiZ):
emit meshlet for draw
drawPhase2();
All of this runs on the GPU. The CPU submits a handful of indirect dispatch commands at the start of the frame and reads no results back. The GPU’s indirect draw infrastructure, available through vkCmdDrawIndexedIndirect in Vulkan or ExecuteIndirect in DirectX 12, lets compute shaders write draw arguments directly into GPU buffers that the rasterization pipeline then consumes.
The Nanite Precedent
Unreal Engine 5’s Nanite system is the most prominent production implementation of this general architecture. Nanite extends it with a LOD system built around a cluster hierarchy: meshes are stored as a tree of meshlet clusters at different levels of detail, and the culling pass also decides which LOD level to use for each visible cluster based on projected screen size.
Nanite’s visibility buffer approach is also worth noting. Rather than rendering geometry into a traditional G-buffer, the first pass writes only a 64-bit visibility value per pixel: 32 bits for the meshlet ID and 32 bits for the triangle ID within the meshlet. Material evaluation happens in a second screen-space pass that reads this buffer and runs material shaders only for pixels that actually appear on screen. This completely eliminates overdraw cost for material evaluation, which matters most for scenes with complex surface shaders.
The combination of fine-grained GPU-driven culling and deferred material evaluation is what lets Nanite handle scenes with hundreds of millions of triangles at real-time frame rates on current console and PC hardware.
Where This Leaves Traditional Approaches
None of this means that old-school CPU culling is useless. For smaller projects, scenes with modest triangle budgets, or targets without mesh shader support (which still includes most mobile hardware and older consoles), the overhead of setting up a full GPU-driven pipeline outweighs the benefit. A scene with 2,000 objects does not need indirect draws and compute-side Hi-Z; a BVH traversal on the CPU works fine.
Mesh shaders require DirectX 12 Ultimate or Vulkan with VK_EXT_mesh_shader, which has been broadly supported on desktop since around 2022 but is still absent on many embedded and mobile targets. Even on desktop, older hardware like NVIDIA’s Turing generation supports mesh shaders with worse performance than Ampere and later because the hardware implementation was not optimized until the dedicated mesh shader processors arrived.
The practical answer for most engines is a tiered system: GPU-driven culling with meshlets for high-end desktop targets, traditional CPU frustum culling with per-object Hi-Z for mid-range targets, and conservative CPU-only frustum culling for mobile. The code paths diverge in the scene submission layer; the rest of the engine sees the same abstract draw list either way.
What Actually Matters in Practice
Having spent time looking at rendering code across several projects, the culling technique is often not the bottleneck people expect. Hi-Z occlusion culling gives the largest wins in scenes with dense occlusion: urban environments, caves, interior spaces. Open terrain or space scenes have low occlusion depth and gain little from Hi-Z; frustum culling and LOD selection do most of the work there.
Meshlet cone culling typically saves 30-50% of triangles for organic meshes like characters and vegetation, where large portions of the surface are consistently back-facing from typical view angles. Hard-surface architecture meshes often have normals distributed in all directions and benefit less from cone culling, though frustum culling at the meshlet level still helps.
The real architectural gain from the GPU-driven approach is not culling efficiency in isolation. It is that the CPU is no longer in the hot path for scene submission. Removing the CPU bottleneck on draw call throughput unlocks the ability to fill the GPU completely, which is where the actual frame time comes from. Better culling is almost a side effect of a pipeline designed primarily to saturate GPU execution units without CPU intervention.