Step 01
Investigate & Localize
- Agent receives a repository with a known performance bottleneck
- Must investigate the codebase, identify slow code paths
- Localize the optimization target
Can AI agents optimize code faster than experts?
§01 · The question
Kraken evaluates AI coding agents on their ability to optimize real-world Python code for runtime performance. Using the SWE-fficiency methodology, agents must investigate repository-level codebases, localize performance bottlenecks, and produce patches that match or exceed expert-level speedup all while maintaining correctness through the project's test suite. Each instance reconstructs a production pull request from 3000 open-source Python repositories, with automated timing harnesses and gold-standard speedup baselines for reproducible evaluation.
§02 · Key metrics
Instances
Models Evaluated
Max Gold Speedup
26.9×
Best HSR (GLM-5)
0.313
Repos Covered
Difficulty Levels
§03 · Evaluation pipeline
Three steps from repository to scored result.
Step 01
Step 02
Step 03
§04 · Results
GLM-5 achieves HSR 0.313. Nova-2-Lite achieves HSR 0.268.
GLM-5 passes 7 of 20 instances outright. Nova-2-Lite passes 2 of 20 but produces correct (slow) patches on 10 more.
GLM-5 costs ~$2.14/instance avg. Nova-2-Lite costs ~$0.09/instance — 24× cheaper.
HSR harmonic mean (Fig. 1), outcome breakdown (Fig. 2), difficulty analysis (Fig. 3), cost-efficiency (Fig. 4), and per-instance detail (Fig. 5). See §05 for per-instance receipts.
§05 · Dataset Viewer
| Instance | Difficulty | Gold Speedup | GLM-5 HSR | Nova HSR | GLM-5 Outcome | Nova Outcome |
|---|
§06 · Model comparison
Head-to-head breakdown of both evaluated models on the Kraken dataset.
GLM-5
Nova-2-Lite
§07 · Methodology
How the metrics scores are computed.
Scoring framework
Principle 01
Principle 02
Principle 03
Principle 04
§08 · Resources