MILO-BENCH — RL Environment

§01 · The environment

MILO-Bench trains AI agents to perform long-horizon software evolution extended from SWE-EVO. Given real-world codebases and their development history, the system works through sequences of 2 to 100+ consecutive production pull requests across Python, Rust, Go, TypeScript, JavaScript, Java, C, and C++. Unlike single-issue setups, it requires maintaining coherent, consistent development over extended timelines. Each instance reconstructs a milestone from post-training-cutoff repositories, with hermetic Docker environments.

Long-horizon evaluation: 2–100+ consecutive PR sequences per instance
Multi-language coverage across 8 programming languages
Contamination-audited instances from post-training-cutoff repositories
Hermetic Docker environments with private holistic test oracles
Three-model reference evaluation framework for reproducible scoring

§02 · The funnel

Fifty curated instances. Three frontier models. Four hundred fifty trajectories.

Instances curated across 8 languages

Models evaluated Claude Opus 4.6 · GLM 5 · Kimi K2.5

Trajectories generated 3 models × 3 seeds × 10,000 instances

§03 · The pipeline

The method

Three phases turn raw repositories into scored RL Environment.

Phase 01

Environment Setup

Crawl 500+ repos, identify post-cutoff PR sequences
Build JSONL dataset with instance metadata
Construct hermetic Docker images per instance (base → PR image)
Generate run results and test baselines

Phase 02

Agent Interaction

Run AI agents in sandboxed Docker environments
3 models × 3 seeds × 10,000 instances
Agents receive issue context and produce git diff patches
Tool-use interaction across long-horizon sequences

Phase 03

Evaluation & Scoring

Apply model patches to clean repo checkouts
Evaluate against full test suite in hermetic containers
Produce pass@k summaries per instance
Grade resolution verdicts across all trajectories

§04 · The results

Pass@3 by PR horizon range for three frontier models (Claude Opus 4.6, Kimi K2.5, GLM 5).

All models show sharp performance degradation as horizon increases from ~65% at 2–5 PRs to near 0% beyond 40 PRs.

Combined Pass Rate by PR Range — all three models — Fig. 1 — Combined pass rate by PR range (all models)

§05 · Pass Rate distribution

Distribution of best Pass@3 rate (highest of Claude Opus 4.6, Kimi K2.5, and GLM 5).

18 ≥ 50% PASS 36.0%

20 10–49% PASS 40.0%

8 < 10% PASS 16.0%

4 0% PASS 8.0%

34.2% Mean Pass Rate

§06 · Dataset viewer

Total Instances

10,000

Models

3

Total Runs

90,000

Mean Pass Rate

34.2%

Loading...

Instance	PR Range	Language	Claude Opus 4.6	GLM 5	Kimi K2.5	Repo

§07 · Resources

Code

Trajectories on GitHub

github.com/Ethara-Ai/milo-bench-paper

Data

Dataset on HuggingFace

huggingface.co/datasets/ethara/MILO-Bench