MILO-BENCH

RL ENVIRONMENT

Can frontier models perform long-horizon software evolution with production-grade reliability?

MILO-Bench trains AI agents to perform long-horizon software evolution extended from SWE-EVO. Given real-world codebases and their development history, the system works through sequences of 2 to 100+ consecutive production pull requests across Python, Rust, Go, TypeScript, JavaScript, Java, C, and C++. Unlike single-issue setups, it requires maintaining coherent, consistent development over extended timelines. Each instance reconstructs a milestone from post-training-cutoff repositories, with hermetic Docker environments.

  • Long-horizon evaluation: 2–100+ consecutive PR sequences per instance
  • Multi-language coverage across 8 programming languages
  • Contamination-audited instances from post-training-cutoff repositories
  • Hermetic Docker environments with private holistic test oracles
  • Three-model reference evaluation framework for reproducible scoring

Fifty curated instances. Three frontier models. Four hundred fifty trajectories.

10,000 Instances curated across 8 languages
3 Models evaluated Claude Opus 4.6 · GLM 5 · Kimi K2.5
90,000 Trajectories generated 3 models × 3 seeds × 10,000 instances

The method

Three phases turn raw repositories into scored RL Environment.

Phase 01

Environment Setup

  • Crawl 500+ repos, identify post-cutoff PR sequences
  • Build JSONL dataset with instance metadata
  • Construct hermetic Docker images per instance (base → PR image)
  • Generate run results and test baselines

Phase 02

Agent Interaction

  • Run AI agents in sandboxed Docker environments
  • 3 models × 3 seeds × 10,000 instances
  • Agents receive issue context and produce git diff patches
  • Tool-use interaction across long-horizon sequences

Phase 03

Evaluation & Scoring

  • Apply model patches to clean repo checkouts
  • Evaluate against full test suite in hermetic containers
  • Produce pass@k summaries per instance
  • Grade resolution verdicts across all trajectories

Pass@3 by PR horizon range for three frontier models (Claude Opus 4.6, Kimi K2.5, GLM 5).

All models show sharp performance degradation as horizon increases from ~65% at 2–5 PRs to near 0% beyond 40 PRs.

Fig. 1 — Combined pass rate by PR range (all models)

Distribution of best Pass@3 rate (highest of Claude Opus 4.6, Kimi K2.5, and GLM 5).

18 ≥ 50% PASS 36.0%
20 10–49% PASS 40.0%
8 < 10% PASS 16.0%
4 0% PASS 8.0%
34.2% Mean Pass Rate

Total Instances

10,000

Models

3

Total Runs

90,000

Mean Pass Rate

34.2%

Loading...
Instance PR Range Language Claude Opus 4.6 GLM 5 Kimi K2.5 Repo