From Benchmark Math to Research Problems: What OpenAI's First Proof Submissions Actually Measured

A month ago, OpenAI published proof attempts submitted by their AI model to the First Proof math challenge, a competition testing research-grade mathematical reasoning on expert-level problems. The framing is modest: these are submitted attempts, not claimed solutions. That framing is worth taking seriously, because the gap between submitting a proof attempt and having a verified proof is where most of the interesting questions about AI mathematics currently live.

For context, consider where the benchmarks stood heading into this. OpenAI’s o3 model, released in December 2024, scored 96.7% on AIME 2024 at high compute, effectively saturating the American Invitational Mathematics Examination. On FrontierMath, a benchmark of research-level problems developed by Epoch AI specifically because standard competition benchmarks had become too easy for frontier models, o3 reached roughly 25% — a substantial jump over its predecessors, but still far short of what an expert mathematician would manage on the same problems. Those two numbers describe the problem: competition mathematics is largely solved by the best reasoning models; research mathematics is not.

The distinction between the two is not simply that research problems are harder. It is that they are structurally different. Competition problems are closed: a well-posed statement exists, a solution exists, the solution path is checkable by anyone who finds it. Research problems may be open in more fundamental ways. The statement itself may require judgment to formalize correctly. The solution path may involve constructions or techniques that do not yet exist in any known literature. And the question of whether a submitted proof is actually correct may not be answerable without significant human expert review.

Informal Proofs and What They Establish

DeepMind’s AlphaProof, which solved four of six IMO 2024 problems at silver-medal level, worked in a paradigm where this ambiguity does not arise. AlphaProof translated problems into Lean 4, ran reinforcement learning over tactic sequences, and accepted only proofs that type-checked against Lean 4’s kernel. The kernel is a few thousand lines of C++ implementing the Calculus of Constructions. If the kernel accepts the proof, it is correct by construction. There is no review step, no ambiguity about whether the argument has a gap. The proof either compiles or it does not.

OpenAI’s o3 and its successors work differently. They produce natural language proofs: mathematical arguments written in standard mathematical prose, structured the way a mathematician would structure a paper. These are informal proofs. They can be extraordinarily sophisticated, capturing genuine mathematical insight, and they can also contain gaps that are subtle enough to evade superficial review. The model does not have access to a mechanical verifier that accepts or rejects what it generates. Extended chain-of-thought reasoning, where the model spends many more compute tokens on internal reasoning before producing output, helps substantially: the AIME and Putnam scores reflect this. But the output is still natural language.

For competition mathematics, this distinction is partly academic. A claimed solution to an AIME problem is either right or wrong, and checking it takes a few minutes. For research mathematics, the distinction matters more. An informal proof of a genuinely hard result requires human expert review before it counts as established. That review can take months, and the reviewers may disagree. This is not a new problem; it is how mathematics has always worked. What changes when AI generates the proof is the volume of attempts and the difficulty of assessing their origin.

What Research-Grade Problems Demand

The First Proof challenge is positioned at the research end of this spectrum. Expert-level problems in mathematics often require three things that competition problems do not: the ability to identify useful intermediate lemmas that are not suggested by the problem statement, the ability to connect techniques from different subfields, and the ability to recognize when an approach is definitively not working and abandon it productively. The FrontierMath benchmark, designed with input from professional mathematicians, found that problems at this level could not be looked up, backward-engineered from format, or solved by scaling memorization. They required genuine construction.

Submitting attempts to such a challenge reveals something that benchmark scores do not. A benchmark measures accuracy over a distribution of problems. A challenge submission is an argument for a specific conclusion, complete enough to be evaluated on its own terms by an expert. The question for the First Proof submissions is not whether the model scores above some threshold; it is whether the arguments it produced are mathematically coherent on the problems they addressed.

OpenAI has been careful, in how they describe these results, not to conflate attempt quality with verified correctness. That caution is appropriate. A natural language proof attempt that is 90% correct is not a proof; it is a proof sketch with a gap. Whether the gaps in any specific submission are fillable, or whether they indicate a fundamental flaw in the argument, requires the kind of expert review that does not scale automatically.

The Verification Question

The tension here is not unique to OpenAI. It runs through the entire question of how AI-generated mathematics gets incorporated into the record. The Lean 4 kernel approach, which Leanstral and AlphaProof both use, resolves it by construction: the verifier is the record, and proofs either pass or do not. But Lean 4 proofs require formalization first, and formalization of a research-level result is a substantial task in itself. The Liquid Tensor Experiment, which formalized Peter Scholze’s condensed mathematics in Lean 4, required years of coordinated community effort even with the mathematical argument largely understood.

Informal proof generation sidesteps formalization cost but reintroduces the question of who checks the result. For research math, the answer is the same as it has always been: working mathematicians, spending days or weeks evaluating whether an argument holds up. The difference is that the argument was produced by a machine, which raises a different set of questions about how to approach the review process. An expert reviewer who finds a gap in a human mathematician’s submitted proof knows roughly how the mathematician was likely thinking and can often diagnose whether the gap is fixable or fatal. The same reviewer, evaluating an AI-generated argument, is working with a system whose reasoning process is less transparent, even when the chain-of-thought traces are published.

Publishing the proof attempts alongside the reasoning traces, as OpenAI did, is part of an answer to this. It gives reviewers something to evaluate beyond the final argument, including the reasoning paths that were considered and rejected. Whether that is sufficient for research-level work is an empirical question that depends on what the attempts actually look like on a case-by-case basis.

What the Trajectory Looks Like

The relevant comparison for these submissions is not AlphaProof, which operates in a different paradigm. The relevant comparison is to what the same model would have produced on the same problems a year earlier. FrontierMath showed a step-change in research-level reasoning between o1 and o3; the First Proof submissions are evidence about whether that step-change extends to problems that require constructing novel arguments rather than solving well-defined computations.

Two things could follow from a positive evaluation of these submissions by domain experts. First, the AI community would have a better calibration point for what informal proof generation can do at research level. Second, there would be a practical question about formalization: if the arguments are sound, someone could formalize them in Lean 4, producing a machine-verifiable record. The informal proof generates the insight; formalization converts it into something the community can depend on without review. That division of labor is not how mathematics currently works, but it is plausible as a workflow once the informal proofs at this level are reliable enough to be worth the formalization investment.

The First Proof submissions are a data point in an ongoing process of calibrating what AI systems can produce at the research end of mathematics. The benchmark era, where competition problems provided the signal, has largely run its course. What comes next is messier and more dependent on expert judgment, but it is also closer to the question that actually matters: whether AI can contribute to mathematics that humans have not yet figured out on their own.