Principle 01
CWE-Based Vulnerability Selection
- 300 CWE classes across 100 real C repositories
- Each vulnerability is human-curated and verified exploitable
- Covers memory safety, injection, auth bypass, and crypto flaws
How well do AI agents fix security vulnerabilities in real codebases?
§01 · Overview
Valkyrie is a reinforcement learning environment for training autonomous software engineering agents to fix security vulnerabilities. Each task includes a codebase with a human curated bug based on a known CWE category, a natural language problem description, and a Docker container for a sandboxed environment. Reward signals are generated using partitioned test suites that track fail-to-pass and pass-to-pass outcomes. Two frontier models were tested on 10000 instances covering 300 CWE classes across 1,000 codebases.
§02 · Key metrics
Three numbers that define the scope of Valkyrie.
CWE Classes Covered
300
Target Codebases
1,000
Instances
10,000
8,412,673 tests
§03 · Methodology
How the metrics and scores are computed. Four principles govern the Valkyrie scoring system.
Principle 01
Principle 02
Principle 03
Principle 04
§04 · Evaluation pipeline
The method
Two to build the dataset, two to evaluate any model against it.
Phase 01
Phase 02
5,270 F2P · 11,486 P2PPhase 03
Phase 04
FAIL_TO_PASS + PASS_TO_PASSSo how did the models actually do? §05 →
§05 · Results
Kimi K2.5 resolves 2 of 20. Nova 2 Lite resolves 0 of 20.
Success rate by difficulty (Fig. 1), resolution rate per CWE class (Fig. 2), and failure-mode breakdown (Fig. 3). See §06 for per-instance details.
§06 · Dataset Viewer
Each run is documented on a receipt-by-receipt basis. You may click on any row to view the complete problem statement, along with test counts and detailed per-model timing and cost information.
Kimi K2.5 · Pass@1
10%
2 of 20 resolved
Nova 2 Lite · Pass@1
0%
0 of 20 resolved
Only Kimi passed
2instances
FFmpeg · jq
Cost spread
$0.02→ $12.59
per-run, Nova low → Kimi high
| Instance | Diff | CWE | F2P | Kimi K2.5 | Nova 2 Lite |
|---|
Click any row for full run metadata.
§07 · Model comparison
Head-to-head breakdown of both models on security vulnerability remediation.
Nova-2-Lite
Kimi K2.5
§08 · Resources