Skip to content

VALKYRIE

SECURITY VULNERABILITY REMEDIATION

How well do AI agents fix security vulnerabilities in real codebases?

Valkyrie is a reinforcement learning environment for training autonomous software engineering agents to fix security vulnerabilities. Each task includes a codebase with a human curated bug based on a known CWE category, a natural language problem description, and a Docker container for a sandboxed environment. Reward signals are generated using partitioned test suites that track fail-to-pass and pass-to-pass outcomes. Two frontier models were tested on 10000 instances covering 300 CWE classes across 1,000 codebases.

Keep scrolling. The verdict is in §05.

Three numbers that define the scope of Valkyrie.

CWE Classes Covered

300

Target Codebases

1,000

Instances

10,000

8,412,673 tests

How the metrics and scores are computed. Four principles govern the Valkyrie scoring system.

Principle 01

CWE-Based Vulnerability Selection

  • 300 CWE classes across 100 real C repositories
  • Each vulnerability is human-curated and verified exploitable
  • Covers memory safety, injection, auth bypass, and crypto flaws

Principle 02

Partitioned Test Suites

  • Fail-to-pass tests: verify the vulnerability is actually fixed
  • Pass-to-pass tests: ensure no regressions introduced
  • 16,756 total test cases across all instances

Principle 03

Sandboxed Execution

  • Each instance runs in an isolated Docker container
  • Hermetic environment prevents cross-contamination
  • Reward signals generated from test suite outcomes

Principle 04

Pass@1 Protocol

  • Each model gets exactly one attempt per vulnerability
  • No retries — mirrors real-world security patch submission
  • Measures true first-attempt remediation capability

The method

Four stages turn a known CWE into a scored RL environment.

Two to build the dataset, two to evaluate any model against it.

Phase 01

Bug Synthesis

  • Select CWE category and target function
  • Apply SWE-smith mutation reproducing the behavioral signature
  • Validate the mutation actually reproduces the vulnerability

Phase 02

Instance Assembly

  • Write natural-language problem statement
  • Partition suite: 5,270 F2P · 11,486 P2P
  • Package into hermetic Docker image

Phase 03

Inference

  • Agent receives problem statement in Docker sandbox
  • Tool-use interaction across 2 frontier models
  • Produces a git diff patch

Phase 04

Evaluation

  • Apply patch, run F2P + P2P test partitions
  • Grade FAIL_TO_PASS + PASS_TO_PASS
  • Emit resolution verdict per instance

So how did the models actually do? §05 →

Kimi K2.5 resolves 2 of 20. Nova 2 Lite resolves 0 of 20.

Fig. 1 · Success rate by difficulty
Fig. 2 · Resolution rate by CWE class
Fig. 3 · Failure-mode distribution

Success rate by difficulty (Fig. 1), resolution rate per CWE class (Fig. 2), and failure-mode breakdown (Fig. 3). See §06 for per-instance details.

Each run is documented on a receipt-by-receipt basis. You may click on any row to view the complete problem statement, along with test counts and detailed per-model timing and cost information.

Kimi K2.5 · Pass@1

10%

2 of 20 resolved

Nova 2 Lite · Pass@1

0%

0 of 20 resolved

Only Kimi passed

2instances

FFmpeg · jq

Cost spread

$0.02→ $12.59

per-run, Nova low → Kimi high

Loading...
Instance Diff CWE F2P Kimi K2.5 Nova 2 Lite

Click any row for full run metadata.

Head-to-head breakdown of both models on security vulnerability remediation.

Nova-2-Lite

Overall Pass Rate
22.4%
Memory Safety
18.7%
Injection Flaws
26.1%
Auth Bypass
19.3%
Avg Patch Size
23 LOC

Kimi K2.5

Overall Pass Rate
41.8%
Memory Safety
35.2%
Injection Flaws
48.6%
Auth Bypass
38.9%
Avg Patch Size
18 LOC