· 7 min read ·

The Hard Part of Private Inference on Idle Macs

Source: hackernews

The premise behind Darkbloom is compact enough to fit in a tweet: use idle Mac hardware to serve LLM inference requests in a way that doesn’t expose user prompts to a centralized provider. The pitch is clean. The engineering underneath it is not.

At its core, Darkbloom is a distributed inference network where Mac owners contribute their machine’s spare compute cycles, and users submit prompts that get routed to those nodes. The privacy claim is what differentiates it from a simple compute marketplace: the network is designed so that the node running your inference doesn’t learn what you asked, and the company operating the network doesn’t see your prompts either. Whether that claim holds up to scrutiny depends heavily on the cryptographic mechanisms behind it.

Why Apple Silicon in Particular

The choice of Macs as the compute substrate is not arbitrary. Apple Silicon chips, from the M1 onward, have a unified memory architecture that makes them unusually capable at LLM inference relative to their thermal envelope and price.

In a conventional setup, the CPU and GPU have separate memory pools. Running a large language model requires constantly moving weight tensors between system RAM and GPU VRAM, and VRAM is both expensive and scarce. A high-end NVIDIA RTX 4090 has 24GB of VRAM. You can fit a 13B parameter model quantized to 4-bit in that, and not much else. A Mac Studio with M2 Ultra ships with 192GB of unified memory, accessible to both the CPU and GPU at up to 800 GB/s of bandwidth. The entire weight set for a 70B parameter model in 4-bit quantization fits comfortably, with room remaining for context.

Apple also ships MLX, a machine learning framework built specifically for Apple Silicon that targets the CPU, the GPU, and the Apple Neural Engine in a unified programming model. The ANE is not a general-purpose compute unit, but for certain operations common in transformer inference it can offload work at very low power consumption. On a machine sitting idle overnight and plugged into mains power, that matters for operating cost.

llama.cpp added Metal backend support in mid-2023, which means the broad ecosystem of quantized GGUF models covering most open-weight releases can run natively on Mac GPUs without additional framework overhead. Projects like Ollama have built on this foundation to make local inference trivially easy to deploy.

Darkbloom’s underlying bet is that there are enough M1-and-later Macs in the world, sitting idle at desks and plugged into power overnight, that the aggregate capacity is commercially meaningful. Given that Apple sold over 200 million Macs in the M-series era and a meaningful fraction of those are 16GB or higher configurations, the raw capacity argument holds.

The Privacy Problem

Running inference on a third-party machine creates a fundamental problem: the machine executing the computation can read the input. If you submit a prompt to a Darkbloom node, the operator of that node could log it.

Privacy-preserving computation has several theoretical solutions, none of which come cheap.

Fully homomorphic encryption (FHE) allows computation on encrypted data. The node processes ciphertext and returns an encrypted result that only the client can decrypt; the node never sees plaintext. The problem is that FHE carries computational overhead of several orders of magnitude relative to plaintext execution. Companies like Zama are actively pushing FHE toward practical performance, but running transformer inference under FHE at acceptable latency remains beyond current techniques for non-trivial models.

Secure multi-party computation (SMPC) allows a group of nodes to jointly compute a function on private inputs such that no single node learns the full input. This works in theory, but requires substantial inter-node communication on every layer of the network. For a transformer with 80 layers distributed across geographically separated hardware, the round-trip latency on each layer’s communication step accumulates quickly. Throughput suffers severely.

Trusted Execution Environments (TEEs) such as Intel SGX or AMD SEV create hardware-enforced enclaves where code executes in isolation from the host operating system. The host machine cannot inspect enclave memory even with root access, and remote attestation lets the client cryptographically verify that the expected code is running in a genuine enclave before sending any data. This is the approach taken by projects like Edgeless Systems and Gramine for confidential computing workloads.

Macs do not have SGX or AMD SEV. The Apple Secure Enclave handles key storage and biometric operations but is not a general-purpose confidential compute environment for arbitrary ML workloads. This is a meaningful constraint, because TEEs represent the most practical path to verifiable privacy on commodity hardware.

How Model Sharding Reduces Exposure

One approach that avoids the full cost of FHE or SMPC is to shard the model across multiple nodes such that no single node holds enough context to reconstruct the prompt cleanly. Petals, developed at HSE University and Yandex Research, pioneered this approach for models too large for any single consumer machine: each node runs a subset of transformer layers, and the computation flows through the chain sequentially.

Sharding reduces the attack surface. A node running layers 20 through 40 of an 80-layer model sees intermediate hidden states, not raw input tokens. Recovering the original prompt from mid-network activations is theoretically possible if the adversary also holds the preceding layer weights, but it requires deliberate effort rather than passive logging.

For smaller models that fit on a single Mac, sharding does not apply in the same way without coordination overhead. The tradeoff is between model capacity, inference latency, and the privacy guarantee you can actually deliver.

Verification and the Node Trust Problem

Any distributed network that uses third-party compute faces the same question: how do you verify that a node actually ran the inference honestly rather than returning garbage or a cached response?

zkML is an emerging approach that uses zero-knowledge proofs to prove that a computation was executed with specific inputs and model weights, without revealing the inputs to the verifier. EZKL and related tools can generate proofs for neural network inference. The problem is that proof generation is substantially slower than the inference itself, and proof size scales with model complexity. Verifying a forward pass through a 7B parameter model produces proofs that are expensive to generate and non-trivial to verify.

Without cryptographic verification, the network falls back to reputation systems and economic penalties. Reputation-based trust works at aggregate scale and over time, but it doesn’t protect any individual inference request from a malicious or compromised node.

This is the gap that Darkbloom’s architecture has to bridge. The specific mechanism they use matters enormously. If nodes are attested through a software-level trust model rather than hardware TEEs, the privacy guarantee is contractual rather than cryptographic.

Practical Threat Model

For a developer choosing between local inference with Ollama, a cloud API like OpenAI or Anthropic, and a distributed network like Darkbloom, the comparison maps roughly onto a spectrum of trust relationships.

Local inference via Ollama is maximally private: your prompt never leaves the machine. Latency is bounded by your hardware. Model size is bounded by your memory. For a 16GB MacBook Pro, you’re running 8B or 13B parameter models at practical quality.

Cloud APIs offer access to frontier models at low marginal cost. Your prompts are transmitted over TLS and processed on the provider’s infrastructure. You trust the provider’s data handling policies and security posture.

Darkbloom sits between those points. The network offers capacity beyond what your own hardware provides, without centralizing all traffic through a single provider. Whether that’s meaningfully better for privacy depends on the architecture: you shift trust from one large company to a distributed set of node operators and the network coordinator.

Many applications handle sensitive but non-regulated data where a user would prefer prompts not be stored by a cloud provider, but doesn’t require cryptographic proof of privacy. Draft contracts, personal financial queries, health questions without clinical identifiers, and business strategy discussions all fit that category. For those use cases, a well-designed distributed network with strong contractual and operational privacy controls might be sufficient.

The Broader Ecosystem

Darkbloom is not the only project in this space. Exo is an open-source distributed inference framework that runs across clusters of consumer devices including Macs, handling model sharding and device discovery without a centralized coordinator. The key difference is that Exo is designed for your own hardware. You trust all the nodes because you own them, so privacy is not a concern.

Projects like Together AI and Fireworks AI offer cheap inference on centralized GPU clusters. They are fast and operationally simple, but the privacy model is that of any cloud provider.

Darkbloom’s commercial opportunity is in the gap between those options, provided the privacy claims are backed by verifiable mechanisms that hold up to independent scrutiny. Apple Silicon makes the compute side of this problem tractable in a way it wasn’t before 2021. The M-series chips turned consumer Macs into capable inference hardware at price points that make voluntary compute contribution economically plausible for node operators.

The privacy side of the problem remains genuinely hard, and how Darkbloom resolves the TEE gap on macOS is the technical question worth watching. If the answer is hardware attestation via a novel mechanism, that’s interesting engineering. If the answer is contractual assurances and monitoring, that’s a business proposition rather than a cryptographic guarantee, and the distinction matters to anyone thinking carefully about their threat model.

Was this interesting?