Mythos Found Thousands of Zero-Days. Then Cloudflare Discovered It Wasn't Enough. -- SafeStackAI

The Most Powerful Security Model Ever Released

In April 2026, Anthropic released Claude Mythos Preview, a frontier model that crossed a threshold no AI had reached before: autonomous vulnerability discovery at expert level.

The benchmarks tell the story. On CyberGym, the standard for measuring vulnerability reproduction, Mythos scored 83.1% compared to 66.6% for Opus 4.6. When pointed at Firefox 147, it generated 181 working shell exploits where Opus 4.6 produced 2.

In red team testing, Mythos found a 27-year-old TCP vulnerability in OpenBSD and a 16-year-old FFmpeg bug that had survived 5 million automated tests. It autonomously chained multiple vulnerability primitives into working exploits, combining techniques like use-after-free bugs with arbitrary read/write capabilities and control flow hijacking.

"The reasoning it shows along the way looks like the work of a senior researcher rather than the output of an automated scanner." — Cloudflare

Anthropic restricted access through Project Glasswing to a launch cohort of twelve organizations, including AWS, Apple, CrowdStrike, Google, Microsoft, and NVIDIA. The results from those deployments revealed something unexpected.

Then Cloudflare Tried Pointing It at Their Code

Cloudflare tested Mythos Preview across 50+ internal repositories over several weeks.

But they also confirmed a pattern that multi-agent architectures were already built to address: a single agent, no matter how capable, covers only a fraction of a large codebase before its context fills up.

The reasons are structural. Context windows limit how much code a single model can reason about at once. Single-stream thinking creates blind spots across parallel attack vectors. And the model develops bias toward attack classes where it has had prior success, leaving entire categories under-examined.

A single agent covers a fraction of a large codebase meaningfully before its context fills up. The model's capability is real. But the architecture around it determines whether that capability translates into actual security coverage.

Architecture Over Raw Capability

Cloudflare's solution was to build a custom multi-agent pipeline that orchestrates Mythos rather than deploying it as a single agent.

The pipeline has eight stages:

Recon — An agent reads the repository top-down, fans out to sub-agents per subsystem, and produces architecture docs with build commands, trust boundaries, and entry points.
Hunt — Roughly 50 agents run concurrently, each hunting a specific attack class in a specific scope. Each has access to compile and run proof-of-concept tools.
Validate — An independent agent re-reads the code and tries to disprove each finding. It uses a different prompt and cannot generate new findings.
Gapfill — Hunters flag under-covered areas for another pass, counteracting model drift toward successful attack classes.
Dedupe — Findings with the same root cause collapse to a single record.
Trace — For findings in shared libraries, a tracer agent fans out across consumer repositories using a cross-repo symbol index to determine if attacker-controlled input can reach the bug.
Feedback — Reachable traces become new hunt tasks in consumer repositories. The pipeline improves as it runs.
Report — An agent writes structured output against a predefined schema, fixing its own validation errors.

The principle underlying the architecture: deliberate disagreement between agents outperforms single-agent self-review. Adding a second agent between the initial finding and the output queue catches noise that a single agent would miss by checking its own work.

SafeStackAI: Built on the Same Principles

SafeStackAI uses a 6-agent pipeline where each agent has a defined role. Findings are promoted only when they survive adversarial review, with an 85% confidence threshold before any finding reaches the final report.

The mapping to Glasswing's architecture is direct:

Pipeline Comparison

Glasswing Stage SafeStackAI Agent Purpose

Recon Threat Modeling + Entrypoint Discovery Map architecture, attack surface, and entry points

Hunt (~50 agents) Security Auditor Vulnerability hunting with scoped focus

Validate Exploitability Analyst + Security Reviewer Adversarial review, disprove findings

Gapfill Autonomous Coordinator Iterative feedback until confidence threshold met

Dedupe Deduplication + 85% confidence gate Collapse duplicates, promote only validated findings

Report Technical Writer Structured, queryable output

The Exploitability Analyst and Security Reviewer serve as adversarial validators. The Security Auditor identifies potential vulnerabilities. Then the Exploitability Analyst determines whether each finding is actually exploitable given the contract's state and access patterns. The Security Reviewer independently attempts to disprove findings using a different analytical frame. Only findings that survive both reviews are promoted.

This is the same principle Cloudflare arrived at: an independent agent with a different prompt and no ability to generate new findings re-reads the code and tries to disprove each result.

Multiple Models, Language-Specific Tooling

Cloudflare's pipeline runs on a single model: Mythos. SafeStackAI takes a different approach. Each stage uses the model best suited to the task. Entrypoint discovery and threat modeling run on smaller, faster models that map attack surfaces efficiently. Vulnerability hunting deploys larger reasoning models where analytical depth matters. Review agents use a different provider entirely, ensuring independent judgment not biased by the hunting model's tendencies.

The deeper difference is in tooling. SafeStackAI's agents don't scan code as text. They operate through 26+ specialized tools built for each supported language: tools that query the AST, resolve cross-function data flows, trace state mutations, and map inheritance hierarchies. These tools are what make the analysis faster and deeper than feeding raw source into a context window.

This is also why SafeStackAI doesn't support every language an LLM can read. The models understand dozens of languages. But understanding syntax is not the same as finding exploitable vulnerabilities. Each language needs its own tooling layer, its own vulnerability knowledge database with known exploitation patterns, mitigation strategies, and language-specific attack vectors. Without that foundation, you get the same shallow results as asking a model to read code and guess.

Structure First, Reasoning Second

Research supports this approach. The VulTriage framework (2026) demonstrated a key insight: "Avoid asking LLM to directly judge raw source code." Feeding structured representations to models, control flow graphs, data flow paths, AST context, produces better results than asking a model to read source text and guess.

The false positive data is telling. Developers stop trusting tools when false positive rates exceed roughly 30%. Multi-agent systems with structured code input achieve far lower false positive rates. The QASecClaw study showed multi-agent debate achieving an 88.6% reduction in false positives.

Call Flow Analysis: The Missing Context Layer

Cloudflare's Trace stage determines whether a flaw in a shared library is actually reachable by attacker-controlled input in consumer repositories. This addresses a gap that exists in most AI security systems: a bug that cannot be reached is not a vulnerability.

SafeStackAI's call flow analysis serves the same purpose through a three-phase process:

Call Graph Mapping — Trace every execution path from public and external functions. Map function chains, build recursive execution trees, and query the AST to understand the full scope of reachable code.
Purpose and Attack Surface Analysis — Identify what each function does from a business logic perspective. Map security-critical operations: fund transfers, access control checks, state mutations.
Context-Enriched Vulnerability Detection — Vulnerability agents receive the structural context from the first two phases. They perform targeted scanning with awareness of actual execution paths, cross-contract interactions, and cross-function data flows.

Without structural context, AI agents produce findings without knowing whether the execution path to the vulnerability exists. With it, every finding comes with the evidence of how it can be reached and exploited.

From Findings to Knowledge

Cloudflare's Feedback stage turns reachable traces into new hunt tasks in consumer repositories. The pipeline gets better as it runs.

SafeStackAI takes a similar approach through its vulnerability database. Each validated vulnerability feeds into a structured knowledge base keyed on threat modeling results. The database contains vulnerability patterns, exploit steps, affected code patterns, and recommended fixes.

This knowledge compounds. Patterns identified in one codebase inform the analysis of the next. The vulnerability database is not a static list of known CVEs. It is a growing body of threat intelligence that makes each subsequent analysis more effective.

Beyond Single Codebases

Cloudflare's Trace and Feedback stages address a problem most security tools ignore: vulnerabilities that only exist at the boundary between systems. A flaw in a shared library matters only if attacker-controlled input can reach it from a consumer service. Cloudflare solves this by fanning out across consumer repositories with a cross-repo symbol index.

SafeStackAI is building toward the same goal at a broader scope. Most products are not a single codebase. They are multiple services, shared libraries, API contracts, and cloud infrastructure that interact in ways no single-repo analysis can capture. A vulnerability in your authentication service might only be exploitable through a specific API gateway configuration. A safe function in isolation becomes dangerous when called by an external service that passes unsanitized input.

SafeStackAI is developing a cross-service analyzer that performs threat modeling and vulnerability hunting across an entire product, not just isolated codebases. It maps how your applications interact, examines your cloud configuration alongside your application code, and identifies vulnerabilities that emerge from the integration of otherwise secure components. The goal is a security posture assessment of the product as a whole.

What Glasswing Validated

Cloudflare's experience with Mythos and Project Glasswing confirmed three principles that SafeStackAI's architecture was built on:

Multi-agent architecture outperforms single-model deployment. Even the most capable model needs orchestration, scoping, and parallel execution to achieve real coverage.
Adversarial validation catches what self-review misses. An independent agent with a different prompt trying to disprove findings filters noise that a single agent checking its own work would let through.
Structured code analysis is foundational. AI agents reason better when they work from ASTs, call graphs, and property graphs rather than raw source text.

The architecture matters more than the model. Cloudflare proved it with the most powerful security model ever released. SafeStackAI was built on the same conclusion.

Mythos Found Thousands of Zero-Days. Then Cloudflare Discovered It Wasn't Enough.