· 6 min read ·

Porting a 4B Parameter 3D Model to Apple Silicon: What Had to Change and Why

Source: hackernews

The interesting thing about this port of Microsoft’s TRELLIS.2 to Apple Silicon is not that it works, though it does, generating roughly 400,000 vertex meshes from a single photograph in about 3.5 minutes on an M4 Pro with 24GB of unified memory. The interesting thing is the accounting: a few hundred lines of changes across nine files, and a 4-billion parameter state-of-the-art image-to-3D model runs offline on consumer hardware with no cloud dependency. That ratio deserves some examination.

What TRELLIS.2 Actually Does

Microsoft’s TRELLIS is a generative model for 3D assets. Given a single image, it produces a structured latent representation that combines geometry and appearance, which then gets decoded into a usable mesh with texture. The key design choice in TRELLIS is that it works in a sparse 3D space rather than a dense voxel grid, which makes the representation memory-efficient but also drives nearly every CUDA dependency in the codebase.

The model uses sparse 3D transformers and sparse 3D convolutions to process these representations. In 3D, most of the space is empty; a chair occupies maybe 5% of its bounding volume. Sparse representations track only the occupied regions, but this requires coordinate-indexed operations that are fundamentally different from the dense tensor arithmetic that standard deep learning frameworks are built around. Libraries like MinkowskiEngine and spconv exist specifically to provide GPU-accelerated sparse convolutions, and both are CUDA-only.

The final step of turning the sparse latent into a mesh involves extracting surfaces from the volumetric representation, which TRELLIS does with CUDA hash map operations for efficiency.

Then there is attention. TRELLIS uses FlashAttention (flash_attn), the memory-efficient attention algorithm by Tri Dao and colleagues that avoids materializing the full attention matrix by tiling the computation across SRAM. FlashAttention is implemented as custom CUDA kernels and does not run on any other hardware. It has become the default for large model training and inference precisely because the naive attention implementation is memory-bound in a way that becomes untenable at scale, but the solution is tightly coupled to CUDA’s programming model.

Finally, the mesh pipeline depends on nvdiffrast, NVIDIA’s differentiable rasterizer, which provides GPU-accelerated rasterization, texture filtering, and antialiasing through custom CUDA code.

So the original TRELLIS.2 requires: CUDA for sparse convolutions, CUDA for FlashAttention, CUDA for mesh extraction, and CUDA for differentiable rasterization. None of those run on a Mac. This is not a packaging problem or a minor compatibility gap; these are fundamentally CUDA-specific implementations.

The Three Substitutions

The port addresses each dependency in turn.

Sparse convolutions get replaced with a gather-scatter implementation in pure PyTorch. The basic idea is to express the sparse operation in terms of standard tensor indexing: gather the features at active spatial positions, apply the convolution weight matrix, and scatter the results back to their output positions. This is representationally equivalent to a custom sparse kernel but makes no assumptions about the underlying hardware. It works on CPU, MPS, or any future PyTorch backend. The tradeoff is performance: a custom CUDA sparse convolution kernel can exploit the memory hierarchy and warp-level parallelism in ways that a generic gather-scatter cannot. But “slower than optimal” is much better than “does not run.”

Attention gets replaced with torch.nn.functional.scaled_dot_product_attention (SDPA), PyTorch’s built-in scaled dot product attention introduced in PyTorch 2.0. On CUDA, PyTorch’s SDPA can dispatch to FlashAttention under the hood. On MPS, it uses Metal’s matrix multiply infrastructure. The interface is identical; only the backend differs. For sparse transformers, this means rewriting the attention calls to use SDPA rather than flash_attn’s API, which has some differences around masking and batching, but the underlying math is the same.

Mesh extraction gets a Python-based replacement for the CUDA hash map operations. Hash maps in CUDA are implemented as custom data structures that operate in parallel across thousands of threads. Replacing this with Python means losing the parallelism, but mesh extraction is not the dominant cost in a 3.5-minute pipeline.

Notably, nvdiffrast appears to be either not required at inference time or handled through a different path in the port. The core generation pipeline does not use differentiable rasterization; that is more relevant for training.

PyTorch MPS and What It Can Do

PyTorch’s MPS backend uses Apple’s Metal Performance Shaders to execute tensor operations on the GPU. Apple introduced it in PyTorch 1.12, and it has matured considerably since. Standard convolutions, matrix multiplications, elementwise operations, and most of the operations used in transformer inference work on MPS. The gaps are mostly at the edges: custom CUDA extensions obviously do not work, some operations fall back to CPU silently, and memory bandwidth characteristics differ from NVIDIA GPUs in ways that affect which optimization strategies apply.

Unified memory is the architectural difference that matters most here. On Apple Silicon, the CPU and GPU share the same physical memory pool. A 24GB M4 Pro has 24GB available to both, with no separate VRAM limit and no PCIe transfer overhead. A 4-billion parameter model in float16 takes roughly 8GB just for weights, which fits comfortably. On a system with a discrete GPU, you would need a card with at least 12-16GB of VRAM to avoid offloading, and high-VRAM consumer cards are expensive.

The H100 comparison in the original post puts the performance gap in perspective. On an H100, TRELLIS generates a mesh in seconds. On an M4 Pro, it takes 3.5 minutes. That is roughly a 20-40x difference, which is consistent with the gap in raw compute: an H100 delivers around 2,000 TFLOPS of FP16 throughput, while the M4 Pro GPU is in the 10-15 TFLOPS range. The unified memory architecture helps with large models, but it does not close the raw throughput gap.

For offline use, the comparison is not M4 Pro versus H100. The comparison is M4 Pro versus paying for cloud API access, waiting for rate limits, and sending your images to someone else’s server. At 3.5 minutes per mesh, local generation is entirely workable for creative and prototyping use.

Why CUDA Lock-In Is Hard to Escape

The broader pattern here is worth naming. Machine learning research has accumulated years of CUDA-specific implementations because CUDA gave researchers the tools to write fast custom kernels, and fast custom kernels enabled better results, and better results drove adoption of those kernels. FlashAttention is a good example: it is not just faster than naive attention, it made training certain model architectures feasible that were previously memory-limited. The result is a stack where the theoretical operations are hardware-agnostic but the implementations that actually run at scale are not.

This creates a persistent gap between “runs on CUDA” and “runs anywhere.” AMD’s ROCm ecosystem attempts to bridge this through HIP, a CUDA-compatible API with source-level porting tools. Apple’s MPS takes a different approach, exposing Metal through PyTorch’s operator dispatch system without trying to be CUDA-compatible. Both paths require porting work when a library has CUDA-specific kernels.

What the trellis-mac port demonstrates is that porting cost can be surprisingly low when the high-level model structure maps cleanly to standard PyTorch operations. The gather-scatter sparse convolution, SDPA attention, and Python mesh extraction are all straightforward substitutions that anyone familiar with PyTorch could implement given the original code. The effort was “a few hundred lines across 9 files,” which is a weekend project, not a team effort.

Projects like TripoSR and InstantMesh occupy a similar space in the image-to-3D landscape, and they have varied levels of portability. TripoSR in particular has seen broader adoption partly because its dependencies are more standard. TRELLIS produces higher-quality results but came with a higher portability cost, which this port now reduces.

What This Changes

Running a 4-billion parameter 3D generation model locally matters for a few reasons beyond the obvious. Privacy is one: your reference images stay on your machine. Iteration speed is another: no upload latency, no rate limits, no per-generation cost. For game developers, VFX artists, or anyone building pipelines around 3D asset generation, local inference changes what is practical to automate.

The hardware requirement is real. 24GB of unified memory is not a baseline configuration; it is the higher-end M4 Pro SKU or any M4 Max/Ultra. Generation time at 3.5 minutes is acceptable for deliberate use but not for tight feedback loops. And the port is explicitly not optimized, which means there is probably headroom left through operator fusion, quantization, or better MPS kernel selection.

But the port exists, it runs, and it required a weekend of work rather than a multi-month engineering effort. That says something useful about where the real barrier to Apple Silicon ML support lies: not in fundamental incompatibility, but in the specific CUDA extensions that libraries reach for when they want performance. Replace those extensions with PyTorch primitives and most modern model architectures run fine on MPS. The gap between CUDA and everything else is real, but it is increasingly one of optimization rather than impossibility.

Was this interesting?