Valkyrie · Security Vulnerability Remediation

§01 · Overview

Valkyrie is a reinforcement learning environment for training autonomous software engineering agents to fix security vulnerabilities. Each task includes a codebase with a human curated bug based on a known CWE category, a natural language problem description, and a Docker container for a sandboxed environment. Reward signals are generated using partitioned test suites that track fail-to-pass and pass-to-pass outcomes. Two frontier models were tested on 10000 instances covering 300 CWE classes across 1,000 codebases.

Keep scrolling. The verdict is in §05.

§02 · Key metrics

Three numbers that define the scope of Valkyrie.

CWE Classes Covered

300

Target Codebases

1,000

Instances

10,000

8,412,673 tests

§03 · Methodology

How the metrics and scores are computed. Four principles govern the Valkyrie scoring system.

Principle 01

CWE-Based Vulnerability Selection

300 CWE classes across 100 real C repositories
Each vulnerability is human-curated and verified exploitable
Covers memory safety, injection, auth bypass, and crypto flaws

Principle 02

Partitioned Test Suites

Fail-to-pass tests: verify the vulnerability is actually fixed
Pass-to-pass tests: ensure no regressions introduced
16,756 total test cases across all instances

Principle 03

Sandboxed Execution

Each instance runs in an isolated Docker container
Hermetic environment prevents cross-contamination
Reward signals generated from test suite outcomes

Principle 04

Pass@1 Protocol

Each model gets exactly one attempt per vulnerability
No retries — mirrors real-world security patch submission
Measures true first-attempt remediation capability

§04 · Evaluation pipeline

The method

Four stages turn a known CWE into a scored RL environment.

Two to build the dataset, two to evaluate any model against it.

Phase 01

Bug Synthesis

Select CWE category and target function
Apply SWE-smith mutation reproducing the behavioral signature
Validate the mutation actually reproduces the vulnerability

Phase 02

Instance Assembly

Write natural-language problem statement
Partition suite: 5,270 F2P · 11,486 P2P
Package into hermetic Docker image

Phase 03

Inference

Agent receives problem statement in Docker sandbox
Tool-use interaction across 2 frontier models
Produces a git diff patch

Phase 04

Evaluation

Apply patch, run F2P + P2P test partitions
Grade FAIL_TO_PASS + PASS_TO_PASS
Emit resolution verdict per instance

So how did the models actually do? §05 →

§05 · Results

Kimi K2.5 resolves 2 of 20. Nova 2 Lite resolves 0 of 20.

Line chart: success rate by difficulty for Kimi K2.5 and Nova 2 Lite — Fig. 1 · Success rate by difficulty

Lollipop chart: resolution rate by CWE vulnerability class — Fig. 2 · Resolution rate by CWE class

Dumbbell chart: failure reason breakdown across resolved and unresolved instances — Fig. 3 · Failure-mode distribution

Success rate by difficulty (Fig. 1), resolution rate per CWE class (Fig. 2), and failure-mode breakdown (Fig. 3). See §06 for per-instance details.

§06 · Dataset Viewer

Each run is documented on a receipt-by-receipt basis. You may click on any row to view the complete problem statement, along with test counts and detailed per-model timing and cost information.

Kimi K2.5 · Pass@1

10%

2 of 20 resolved

Nova 2 Lite · Pass@1

0%

0 of 20 resolved

Only Kimi passed

2instances

FFmpeg · jq

Cost spread

$0.02→ $12.59

per-run, Nova low → Kimi high

Loading...

Instance	Diff	CWE	F2P	Kimi K2.5	Nova 2 Lite

Click any row for full run metadata.

§07 · Model comparison

Head-to-head breakdown of both models on security vulnerability remediation.

Nova-2-Lite

Overall Pass Rate: 22.4%
Memory Safety: 18.7%
Injection Flaws: 26.1%
Auth Bypass: 19.3%
Avg Patch Size: 23 LOC

Kimi K2.5

Overall Pass Rate: 41.8%
Memory Safety: 35.2%
Injection Flaws: 48.6%
Auth Bypass: 38.9%
Avg Patch Size: 18 LOC

§08 · Resources

GitHub

Trajectories

github.com/Ethara-Ai/Valkyrie

HuggingFace

Dataset

huggingface.co/datasets/ethara/Valkyrie

Paper

Research Paper

arxiv.org/abs/2504.21798