KRAKEN

PERFORMANCE METRICS

Can AI agents optimize code faster than experts?

Kraken evaluates AI coding agents on their ability to optimize real-world Python code for runtime performance. Using the SWE-fficiency methodology, agents must investigate repository-level codebases, localize performance bottlenecks, and produce patches that match or exceed expert-level speedup all while maintaining correctness through the project's test suite. Each instance reconstructs a production pull request from 3000 open-source Python repositories, with automated timing harnesses and gold-standard speedup baselines for reproducible evaluation.

Keep scrolling. The answer is in §04.

Instances

Models Evaluated

Max Gold Speedup

26.9×

Best HSR (GLM-5)

0.313

Repos Covered

Difficulty Levels

Three steps from repository to scored result.

Step 01

Investigate & Localize

  • Agent receives a repository with a known performance bottleneck
  • Must investigate the codebase, identify slow code paths
  • Localize the optimization target

Step 02

Optimize & Patch

  • Produce a code patch that improves runtime performance
  • Measured against expert gold-standard speedup
  • Scored via Speedup Ratio (SR) metric

Step 03

Verify Correctness

  • Patched code must pass all covering correctness tests
  • Incorrect patches penalized: SR = 1/Gold_Speedup
  • Correctness is never sacrificed for speed

GLM-5 achieves HSR 0.313. Nova-2-Lite achieves HSR 0.268.

GLM-5 passes 7 of 20 instances outright. Nova-2-Lite passes 2 of 20 but produces correct (slow) patches on 10 more.

GLM-5 costs ~$2.14/instance avg. Nova-2-Lite costs ~$0.09/instance24× cheaper.

Fig. 1 — HSR Harmonic Mean
Fig. 2 — Outcome Distribution
Fig. 3 — HSR by Difficulty
Fig. 4 — Cost vs. HSR
Fig. 5 — Per-Instance HSR

HSR harmonic mean (Fig. 1), outcome breakdown (Fig. 2), difficulty analysis (Fig. 3), cost-efficiency (Fig. 4), and per-instance detail (Fig. 5). See §05 for per-instance receipts.

Dataset viewer for 20 instances evaluated against two models.
Instance Difficulty Gold Speedup GLM-5 HSR Nova HSR GLM-5 Outcome Nova Outcome

Head-to-head breakdown of both evaluated models on the Kraken dataset.

GLM-5

HSR Harmonic Mean
0.313
Correctness Rate
70%
Pass@1
7 / 20
Outcome Split
6 fail · 7 slow · 7 pass
Avg Cost
$2.41

Nova-2-Lite

HSR Harmonic Mean
0.268
Correctness Rate
60%
Pass@1
2 / 20
Outcome Split
8 fail · 10 slow · 2 pass
Avg Cost
$0.27

How the metrics scores are computed.

Scoring framework

Four principles govern the Kraken scoring system.

Principle 01

Speedup Ratio (SR)

  • SR = Speedup_LM / Speedup_Gold
  • A score of 1.0 means the agent matched the expert
  • Values above 1.0 indicate the agent exceeded expert performance

Principle 02

Harmonic Mean

  • Individual SR values aggregated via harmonic mean
  • Penalizes inconsistency across instances
  • Prevents a single outlier from inflating the score

Principle 03

Correctness Gating

  • Patches must pass Covering Test Suite (CTS)
  • Failed patches penalized: SR = 1/Gold_Speedup
  • Correctness is never sacrificed for speed

Principle 04

Pass@1 Protocol

  • Each model gets exactly one attempt per instance
  • Mirrors real-world single-submission workflow
  • No retry, no cherry-picking best runs