· 7 min read ·

What It Actually Takes to Run Autonomous Research Across a Peer-to-Peer Network

Source: lobsters

The Two Hard Problems That Collide

Distributed computing and automated research are both well-studied areas, but combining them creates problems that neither field has fully solved on its own. The community.computer project surfaced in a recent Lobsters thread with a deceptively simple framing: a peer-to-peer network where nodes collaboratively perform automated research. The idea sounds natural in 2026, when LLM-based agents can draft literature reviews, generate hypotheses, and synthesize information from multiple sources. But making that work across untrusted, heterogeneous nodes requires solving coordination, verification, and incentive problems that go well beyond what a single-machine research agent deals with.

It helps to separate the two systems and understand what each one demands.

Distributed Compute: What BOINC Got Right and Wrong

The canonical model for volunteer distributed computing is BOINC, the Berkeley Open Infrastructure for Network Computing, which has been running since 2002 and still underpins Folding@home and SETI@home’s successor projects. BOINC’s design philosophy is centralized task distribution over a decentralized compute pool. A project server slices work into workunits, sends them to volunteers, and collects results. Redundancy is handled by sending the same workunit to multiple nodes and comparing outputs; if a majority agree, the result is accepted. This works well for embarrassingly parallel scientific computations where the task is deterministic and the output is a number or a small structured result.

BOINC’s weak point for autoresearch is that research tasks are not embarrassingly parallel and not easily verified by comparison. If two nodes independently summarize a paper, their outputs won’t be byte-identical. If one node performs a web search and the other performs the same search ten minutes later, the results may differ. The output of a research task is a natural language artifact, and there’s no simple checksum that tells you whether it’s correct.

Petals, from BigScience and LAION, took a different approach to distributed AI compute. Instead of distributing independent tasks to independent nodes, Petals pipelines a single large model across multiple nodes: each node holds a contiguous block of transformer layers, and forward passes flow through the pipeline in sequence. This lets you run models like BLOOM-176B across consumer hardware that couldn’t individually hold the full model. The coordination challenge here is latency rather than correctness: a slow node in the pipeline stalls every request that passes through it, and Petals handles this with dynamic routing that bypasses degraded servers.

Neither the BOINC model nor the Petals model maps cleanly onto collaborative autoresearch. BOINC assumes deterministic, verifiable tasks. Petals assumes a single coordinated model inference. Autoresearch involves multi-step workflows where intermediate outputs are stochastic, where a node might need to make a decision about what to research next, and where quality is difficult to assess without domain expertise.

The Coordination Problem in a Trustless Setting

A P2P research network without a central server has to solve Byzantine fault tolerance at the workflow level, not just the data level. In a classic BFT system, nodes agree on a value from a set of proposals under the assumption that up to one-third of nodes may be malicious. In a research network, the failure mode isn’t just a node returning wrong answers, it’s a node returning plausible-sounding wrong answers that are hard to distinguish from correct ones without reading them carefully.

This is why verification is the core architectural problem. A few approaches exist in adjacent systems:

Optimistic execution with challenge periods. This is the model used by Ethereum’s optimistic rollups. A node submits a result and it’s provisionally accepted; a challenge window opens during which any other node can dispute the result and trigger a re-execution. For financial transactions, re-execution is deterministic and disputes are mechanically resolvable. For research outputs, re-execution produces a different output, and evaluating which one is better requires human judgment or a trusted referee.

Proof-of-useful-work. Prime Intellect’s PRIME distributed training framework uses gradient verification: a node claiming to have performed a training step must produce gradients that are consistent with the published model state and the published batch of data. This is cryptographically verifiable because training is deterministic given the data and the initial weights. Research synthesis doesn’t have this property, because there’s no canonical correct answer to verify against.

Reputation systems with stake. A node that consistently produces low-quality outputs loses reputation and eventually gets deprioritized or excluded. This works in markets with well-defined quality signals, like prediction markets or professional services with reviews. For research tasks, quality is fuzzy and may not be apparent until the research is applied downstream.

The honest answer is that no existing system has fully solved trustless quality verification for natural language outputs at scale. The most practical current approach is probably hybrid: a P2P network for compute and coordination, with occasional oracle checks where a trusted node or a high-quality model evaluates a sample of outputs for quality.

What libp2p Gives You Out of the Box

libp2p is the networking stack that IPFS, Ethereum 2.0, and a growing number of distributed systems use as a foundation. It handles peer discovery via DHT (Kademlia), NAT traversal, multiplexed streams over a single connection, and pluggable transports (TCP, QUIC, WebTransport). For a P2P research network, libp2p provides the plumbing so you don’t have to reinvent hole-punching or peer routing.

A research node joining the network would roughly follow this sequence:

1. Bootstrap by connecting to a set of well-known peers (hardcoded or DNS-based)
2. Announce capability via DHT (e.g., "I can run 8B-parameter models at X tokens/sec")
3. Subscribe to a pubsub topic for task announcements
4. Receive task description, download context from IPFS CID
5. Execute the research subtask locally
6. Publish result back to the task's coordination address
7. Optionally receive a stake-weighted reward

The GossipSub protocol, which libp2p uses for pub/sub messaging, is efficient for propagating task announcements and results across a large network without requiring every node to maintain connections to every other node. A node maintains connections to a mesh of around 6-12 peers, and messages propagate with high reliability through gossip.

The content-addressing in IPFS is genuinely useful for research contexts. If a task requires processing a particular paper or dataset, the content can be addressed by its hash. Multiple nodes requesting the same content can fetch it from any node that has it, without trusting a central source. This also makes the research provenance auditable: a result can reference the exact content CIDs it was derived from.

Autoresearch: What the Workflow Actually Looks Like

The “auto” in autoresearch implies autonomous multi-step execution: a system that can decompose a research question, identify what it needs to know, fetch or generate that information, synthesize it, and iterate. This is the architecture that systems like AutoGPT attempted and that more recent agent frameworks like LangGraph and the Anthropic agent SDK have put on firmer footing.

In a distributed context, the workflow decomposition becomes a coordination problem. If a research question is broken into subtasks, those subtasks have dependencies: synthesizing findings requires that the individual retrieval tasks have completed. This is a DAG execution problem, similar to what systems like Ray or Dask solve for data pipelines. In a P2P setting without a central scheduler, the DAG has to be encoded into the task descriptions themselves, with each task specifying its inputs by CID and its outputs by promised CID. Nodes can pick up tasks whose inputs are already available.

One architectural choice that matters a lot is where the model runs. A heterogeneous network will have nodes with very different capabilities: some with consumer GPUs capable of running Mistral 7B or LLaMA 3 8B, some with larger GPU clusters capable of running 70B models, some with only CPU and RAM. Task routing that matches task complexity to node capability is non-trivial and is essentially a scheduling problem. The Petals approach of model sharding helps here for tasks that require a single large model, but research pipelines often want to use different models for different subtasks: a fast small model for initial retrieval and filtering, a larger model for synthesis and judgment.

What’s Actually Interesting About This Direction

The reason a P2P autoresearch network is more compelling now than it would have been five years ago is that capable open-weight models exist. Running GPT-4-class inference requires paying OpenAI. Running Qwen-2.5-72B or Llama-3.3-70B requires hardware, but that hardware can be distributed across volunteer nodes using Petals-style model sharding, and the models are genuinely capable of multi-step research tasks.

The second reason is that IPFS-style content addressing makes research artifacts first-class objects in the network. Earlier distributed research systems were hampered by the problem of context: a node can’t do meaningful research if it doesn’t have access to the same documents, codebases, and prior outputs as the rest of the network. Content-addressed storage, combined with a well-seeded IPFS cluster, makes this tractable.

The third reason is that the incentive layer is more mature. Filecoin and similar protocols have demonstrated that you can pay nodes for storage and retrieval using token-based incentives, with cryptographic proofs of service. Adapting this to research compute, where the proof of service is harder to construct, is an open problem, but the payment infrastructure exists.

The gap between the idea and a working system is still wide. Quality verification, Byzantine-fault-tolerant workflow coordination, and economic incentive design for non-deterministic tasks are all unsolved at production scale. But the components, libp2p for networking, content addressing for provenance, open-weight models for inference, existing agent frameworks for workflow, are all available and mature enough to build on. A serious engineering effort in this direction is more plausible today than it has ever been.

Was this interesting?