What CTF Competitions Lose When AI Captures the Flag

A commenter on a recent Lobsters thread described something worth examining in detail: they were supervising their alma mater’s Capture the Flag competition and watching students use generative AI tools to solve challenges without any real engagement with the underlying problems. The thread collected similar reports from across technical creative domains, but the CTF observation is worth treating separately because it illustrates a more specific kind of damage.

Using AI to capture a CTF flag undermines competition fairness, but more fundamentally it removes the pedagogical mechanism that justified running the competition in the first place.

What CTFs Actually Are

Capture the Flag competitions emerged from the DEF CON hacking contest in 1996 and grew into the standard training ground for security professionals. The modern Jeopardy format, where a scoreboard of isolated challenges each return a unique flag string when solved, became dominant in the early 2000s. The ecosystem formalized over time: picoCTF at Carnegie Mellon for beginners, CTFtime.org as the aggregator for hundreds of competitions worldwide, and DEF CON CTF Finals as the sport’s elite tier.

What made CTFs work as a learning tool was specific. You encounter a challenge in a domain you may know incompletely, whether that is binary exploitation, web security, cryptography, reverse engineering, or forensics. You have no instructions beyond the challenge file and a server endpoint. You struggle, search, form hypotheses, and test them. When you solve it, you have built something transferable, not just retrieved a fact. The writeup culture that grew around CTFs, publishing detailed solutions after the competition closes, created one of the best bodies of applied security knowledge on the internet.

The flag was a proxy measurement for the competence built in the process of getting it.

What the Research Says About AI and CTF Solving

Researchers at the University of Illinois Urbana-Champaign published a series of papers in 2024 that quantified what practitioners were already observing. In their work on LLM agents exploiting one-day vulnerabilities, and a follow-up on teams of LLM agents exploiting zero-day vulnerabilities, they tested GPT-4 with agentic scaffolding, including tool access to a Linux shell, a Python interpreter, and a web browser, against real CTF challenges.

Against easy challenges from CTFtime, GPT-4 solved 87 of 91. Against harder, competition-grade challenges from 2024 CTFs, the success rate dropped to roughly 13 percent. The categories where AI performs best overlap directly with the categories most central to security education: classical cryptography, textbook SQL injection and XSS patterns, and basic binary exploitation with well-known technique signatures. The hard challenges that AI still fails on tend to be novel, requiring reasoning about custom implementations or chains of vulnerabilities that do not pattern-match to training data.

The InterCode-CTF benchmark, published at NeurIPS 2023, showed GPT-4 solving about 26 percent of 100 picoCTF challenges in an interactive loop. picoCTF targets high school students and early undergraduates, the entrypoint of the learning pipeline. That is the population the Lobsters commenter was watching.

One technical detail matters when evaluating any policy response: raw chat completions do not achieve these numbers. The high solve rates require agentic setups where the model executes code, reads output, iterates, and retries. A student using Cursor or Claude with computer use capability is not just asking a question; they are running an agent loop against the challenge server. The distinction between a chat query and a tool-equipped agent is meaningful, and it complicates any attempt to draw a line between permitted and prohibited assistance.

Why the Learning Event Is What Breaks

CTF challenges work pedagogically because they require connecting multiple pieces of knowledge in non-obvious ways. A web challenge might require understanding how a specific PHP type coercion behavior interacts with a JWT implementation detail. A binary exploitation challenge might require understanding the heap layout implications of a particular glibc version’s allocator. The connection itself, the act of bridging those two domains under time pressure, is what builds judgment.

When a student uses AI to solve that challenge, one of two things happens. Either the AI has seen similar patterns in training and produces a solution the student can copy without understanding, or it fails and the student is no better off. In neither case does the student build the understanding the challenge was designed to produce.

This differs from using a debugger or a decompiler. Those tools make the analysis possible; they do not perform the analysis. The standard analogy is calculators in math class, but it does not hold. A calculator handles arithmetic while the student still determines what calculation to set up and why. An AI agent can identify the vulnerability class, write the exploit, and return the flag, leaving the student with a point on the scoreboard and nothing transferable.

The Hardening Arms Race

Challenge developers are responding with technically interesting approaches. The core observation is that LLMs fail on material outside their training distribution, so organizers are deliberately designing outside it.

One strategy is novelty: building challenges around CVEs disclosed after known model training cutoffs, or constructing entirely custom cryptographic protocols that match no named cipher in the literature. Another is multi-stage chaining, where each flag serves as key material for the next stage, requiring sequential interactive execution that is computationally expensive for agent loops. Some challenge authors have added timing constraints: the server requires a response within 100 milliseconds, making multi-round LLM API calls impractical within the window.

Platforms like CryptoHack have moved harder community challenges to private or invite-only tiers after observing that GPT-4-class models solve virtually all of their standard difficulty cryptography problems with minimal effort. The educational material most accessible to AI was also, by design, the most pedagogically foundational.

The most structurally resilient format is attack-defense, which is how DEF CON CTF Finals already operates. Teams defend live services while attacking identical services run by opposing teams. The real-time adversarial component, the required team coordination, and the need to patch vulnerabilities in code you have just received are all difficult for current AI agents to handle at competition speed. Organizers of the Finals have pointed to this format distinction as a structural safeguard rather than explicit AI policy, which is a reasonable position: the format was already better than Jeopardy at measuring real security skill.

All of these mitigations push challenge design toward novelty and away from fundamentals. The classic educational challenges, RSA small exponent attacks, CBC padding oracles, Diffie-Hellman subgroup attacks, are now effectively unscorable at any competition that permits unrestricted AI access. These are exactly the challenges that built foundational intuition for a generation of security practitioners.

The Broader Pattern

The Lobsters thread collected reports from across technical creative domains, and they share a common structure. Hobbyist programming projects feel hollow when you reach for AI completion before you have had a chance to struggle with the problem. Stack Overflow’s question volume dropped substantially between 2022 and 2024 as developers redirected first-line questions to AI tools, which means the public record of human problem-solving is accumulating more slowly. Open source maintainers including curl’s Daniel Stenberg have written about the increased triage burden from AI-generated patches that are syntactically correct but logically flawed.

In each case, the difficulty of the activity was not incidental to its value. Hobbyist programming built intuition through struggle. Stack Overflow’s questions built a searchable knowledge base because humans were doing the searching and articulating their confusion. Open source contribution built developer skill through the friction of working in someone else’s codebase under review. Removing the friction does not preserve the activity; it replaces it with something that produces the artifact without the development.

The specific harm at the intermediate skill level deserves attention. The 1,000 to 5,000 hours of deliberate practice that CTFs uniquely supported, the zone where a novice becomes someone capable of real security work, is precisely where AI assistance is most effective. Beginner knowledge is accessible through documentation. Expert knowledge requires novel reasoning that current models cannot reliably supply. The middle is the most disrupted, which is where most professional development actually happens.

What Survives

The knowledge does not disappear from the world, but it becomes rarer and harder to develop. CTF competitions will likely bifurcate. Controlled, in-person, attack-defense formats at the elite tier will remain meaningful because the environment constrains AI assistance structurally rather than by policy. Beginner-tier Jeopardy competitions will drift toward AI-assisted score farming, and the learning pipeline that fed the professional security field will thin.

The students the Lobsters commenter was watching will collect flags. They are unlikely to perform equivalently in the technical interviews that flags were meant to signal readiness for, because those interviews will not provide an AI agent endpoint. That feedback loop is longer and slower than a CTF weekend, and it falls outside any single organizer’s scope to fix.

What challenge developers can do, and are doing, is design for the aspects of security work that current AI cannot replace: novel adversarial reasoning, judgment under ambiguity, and lateral thinking that does not pattern-match to training data. Those challenges are harder to write and harder to solve. They are also more representative of the actual job, which is a reasonable place for the difficulty to end up.