The Privacy Claim That Makes Distributed Mac Inference Interesting
Source: hackernews
When Darkbloom landed on Hacker News with nearly 500 points, the obvious read was another entry in the growing list of projects trying to build a distributed inference network out of consumer hardware. That framing undersells the harder question the project raises: what does “private” actually mean when your prompt is being processed by hardware you do not own?
The hardware half of the pitch is straightforward and genuinely well-timed. Apple Silicon has made Macs into surprisingly capable inference nodes. A MacBook Pro with an M3 Max and 128 GB of unified memory can run a quantized Llama 3 70B model entirely in memory, achieving 15 to 20 tokens per second with llama.cpp or faster with Apple’s own MLX framework. An M2 Ultra can address up to 192 GB of unified memory, which means models that would require a multi-GPU server setup fit comfortably on a single desktop. The memory bandwidth numbers matter here: the M3 Max delivers around 300 GB/s, while the M4 Pro pushes above 270 GB/s, and inference throughput for transformer models is almost entirely memory-bandwidth-bound. These are not datacenter numbers, but they are not toy numbers either.
The idle part of the pitch also makes sense. Macs spend a substantial portion of each day sitting at low utilization while their owners are in meetings, asleep, or working on documents. The compute is there and going unused. Darkbloom’s basic premise, aggregating that idle capacity into a usable inference service, is structurally similar to what SETI@home and Folding@home did for scientific computing, or what the Petals project demonstrated for running large language models cooperatively across volunteer machines. The exo project from 2024 took this further by building a distributed inference ring specifically for Apple Silicon devices, letting a cluster of Macs split a model across layers and serve inference together over a local or wide-area network.
The model-splitting approach exo uses is worth understanding because it frames the privacy problem clearly. In a layer-partitioned setup, each participating device holds a contiguous slice of the model’s layers. A prompt enters the first node, which processes the embedding layer and the first set of transformer blocks, then passes the resulting hidden states to the next node, which processes the next set of layers, and so on until the final node produces output tokens. The computational graph is distributed, but the data flowing between nodes, the hidden state activations, is not encrypted. Any node can, in principle, observe what passes through it. A sophisticated operator with access to another portion of the model could, with enough effort, invert or probe those intermediate activations. The first node, which sees the raw prompt tokens, can read the prompt directly.
This is the gap that makes “private inference” a genuinely hard engineering problem rather than a marketing claim. There are four broad approaches, each with real costs.
Homomorphic encryption allows computation over ciphertext, so a compute node could process inference without ever decrypting the prompt or activations. Microsoft SEAL and similar libraries make this theoretically possible. The practical cost is that HE operations run anywhere from 1,000 to 1,000,000 times slower than plaintext equivalents depending on the operation and scheme. Running transformer attention over homomorphically encrypted inputs is not close to feasible at conversational latency today.
Secure multi-party computation (SMPC) distributes the computation such that no single party holds enough information to reconstruct the inputs. Projects like CrypTen explore this for machine learning. The communication overhead between parties makes it impractical for the kind of fine-grained per-layer communication that distributed inference requires.
Trusted Execution Environments (TEEs) offer a more pragmatic path. An enclave, such as Intel SGX or AMD SEV, creates a hardware-isolated memory region that even the host operating system cannot read. Computation inside the enclave is attested: the requester can verify cryptographically that a specific, unmodified piece of code is running. The problem for Darkbloom is that Apple Silicon does not expose a general-purpose TEE suitable for running large model inference. The Secure Enclave on Apple chips is a separate processor designed for key operations and small cryptographic tasks, not for running transformer blocks on multi-gigabyte weights.
Attestation with transparency, which is what Apple built into Private Cloud Compute for Apple Intelligence, sidesteps full cryptographic privacy and instead provides auditability. The servers publish a signed log of exactly which software builds are running. Independent auditors can verify that the deployed code matches published source. Users’ devices verify remote attestation before sending data. A node operator running unmodified, audited software cannot log prompts without that modification being detectable. This is not mathematically private, but it is a serious operational privacy guarantee when the software stack is genuinely open to inspection.
For a network of consumer Macs not running Apple’s custom secure OS stack, replicating that attestation model is possible in principle but harder in practice. The node operator controls their own machine. They can run whatever software they want. Without a hardware root of trust that enforces which code executes, an attestation scheme depends on the operator not modifying their installation, which is a social guarantee, not a cryptographic one.
The architecturally interesting question for Darkbloom is which of these trade-offs they have chosen, and how explicitly they communicate those trade-offs to users. “Private” in the context of distributed inference can legitimately mean several different things: private from the network provider but not from compute nodes, private from compute nodes through layer partitioning but not from a sophisticated aggregation attack, or private in the operational sense through auditable software with no logging. Each of these is a real form of privacy for some threat model. None of them is absolute.
Where Darkbloom sits in the ecosystem, regardless of how the privacy question resolves, is at an interesting convergence of trends. Apple Silicon’s inference performance per watt is genuinely compelling for inference workloads that fit in memory. The MLX framework, which Apple released in late 2023, has matured into a fast and ergonomic path for running quantized models on Apple hardware, and its unified memory model means weights and activations share the same physical memory pool without PCIe transfer overhead. A network of M-series Macs is not trying to compete with a rack of H100s on throughput; it is competing on cost, availability, and the regulatory and trust profile of not sending data to a hyperscaler.
There is a real market for inference that does not touch AWS or Google Cloud, driven by compliance requirements, sensitive domain data, and simple preference. Local inference with tools like Ollama addresses that for users who have powerful enough personal machines. Darkbloom’s bet is that the same privacy-motivated users would be willing to trust a distributed network of Macs if the privacy guarantees are clearly specified and defensible.
That bet might be right. But the project will be judged by how clearly it draws the line between what its architecture actually protects and what it does not. The hardware story is easy to believe. The privacy story requires reading the fine print.