Phase 01
Environment Setup
- Crawl 500+ repos, identify post-cutoff PR sequences
- Build JSONL dataset with instance metadata
- Construct hermetic Docker images per instance (base → PR image)
- Generate run results and test baselines
Can frontier models perform long-horizon software evolution with production-grade reliability?
§01 · The environment
MILO-Bench trains AI agents to perform long-horizon software evolution extended from SWE-EVO. Given real-world codebases and their development history, the system works through sequences of 2 to 100+ consecutive production pull requests across Python, Rust, Go, TypeScript, JavaScript, Java, C, and C++. Unlike single-issue setups, it requires maintaining coherent, consistent development over extended timelines. Each instance reconstructs a milestone from post-training-cutoff repositories, with hermetic Docker environments.
§02 · The funnel
Fifty curated instances. Three frontier models. Four hundred fifty trajectories.
§03 · The pipeline
The method
Phase 01
Phase 02
Phase 03
§04 · The results
Pass@3 by PR horizon range for three frontier models (Claude Opus 4.6, Kimi K2.5, GLM 5).
All models show sharp performance degradation as horizon increases from ~65% at 2–5 PRs to near 0% beyond 40 PRs.
§05 · Pass Rate distribution
Distribution of best Pass@3 rate (highest of Claude Opus 4.6, Kimi K2.5, and GLM 5).
§06 · Dataset viewer
Total Instances
10,000
Models
3
Total Runs
90,000
Mean Pass Rate
34.2%
| Instance | PR Range | Language | Claude Opus 4.6 | GLM 5 | Kimi K2.5 | Repo |
|---|
§07 · Resources