Kraken — Performance Optimization Benchmark

§01 · The question

Kraken evaluates AI coding agents on their ability to optimize real-world Python code for runtime performance. Using the SWE-fficiency methodology, agents must investigate repository-level codebases, localize performance bottlenecks, and produce patches that match or exceed expert-level speedup all while maintaining correctness through the project's test suite. Each instance reconstructs a production pull request from 3000 open-source Python repositories, with automated timing harnesses and gold-standard speedup baselines for reproducible evaluation.

Keep scrolling. The answer is in §04.

§02 · Key metrics

Instances

Models Evaluated

Max Gold Speedup

26.9×

Best HSR (GLM-5)

0.313

Repos Covered

Difficulty Levels

§03 · Evaluation pipeline

Three steps from repository to scored result.

Step 01

Investigate & Localize

Agent receives a repository with a known performance bottleneck
Must investigate the codebase, identify slow code paths
Localize the optimization target

Step 02

Optimize & Patch

Produce a code patch that improves runtime performance
Measured against expert gold-standard speedup
Scored via Speedup Ratio (SR) metric

Step 03

Verify Correctness

Patched code must pass all covering correctness tests
Incorrect patches penalized: SR = 1/Gold_Speedup
Correctness is never sacrificed for speed

§04 · Results

GLM-5 achieves HSR 0.313. Nova-2-Lite achieves HSR 0.268.

GLM-5 passes 7 of 20 instances outright. Nova-2-Lite passes 2 of 20 but produces correct (slow) patches on 10 more.

GLM-5 costs ~$2.14/instance avg. Nova-2-Lite costs ~$0.09/instance — 24× cheaper.

HSR harmonic mean per model — Fig. 1 — HSR Harmonic Mean

Outcome distribution per model — Fig. 2 — Outcome Distribution

HSR grouped by difficulty level — Fig. 3 — HSR by Difficulty

Cost versus HSR performance — Fig. 4 — Cost vs. HSR

Per-instance HSR comparison — Fig. 5 — Per-Instance HSR

HSR harmonic mean (Fig. 1), outcome breakdown (Fig. 2), difficulty analysis (Fig. 3), cost-efficiency (Fig. 4), and per-instance detail (Fig. 5). See §05 for per-instance receipts.

§05 · Dataset Viewer

Dataset viewer for 20 instances evaluated against two models.
Instance	Difficulty	Gold Speedup	GLM-5 HSR	Nova HSR	GLM-5 Outcome	Nova Outcome

§06 · Model comparison

Head-to-head breakdown of both evaluated models on the Kraken dataset.

GLM-5

HSR Harmonic Mean: 0.313
Correctness Rate: 70%
Pass@1: 7 / 20
Outcome Split: 6 fail · 7 slow · 7 pass
Avg Cost: $2.41

Nova-2-Lite

HSR Harmonic Mean: 0.268
Correctness Rate: 60%
Pass@1: 2 / 20
Outcome Split: 8 fail · 10 slow · 2 pass
Avg Cost: $0.27

§07 · Methodology

How the metrics scores are computed.

Scoring framework

Four principles govern the Kraken scoring system.

Principle 01

Speedup Ratio (SR)

SR = Speedup_LM / Speedup_Gold
A score of 1.0 means the agent matched the expert
Values above 1.0 indicate the agent exceeded expert performance

Principle 02

Harmonic Mean

Individual SR values aggregated via harmonic mean
Penalizes inconsistency across instances
Prevents a single outlier from inflating the score

Principle 03

Correctness Gating

Patches must pass Covering Test Suite (CTS)
Failed patches penalized: SR = 1/Gold_Speedup
Correctness is never sacrificed for speed

Principle 04

Pass@1 Protocol

Each model gets exactly one attempt per instance
Mirrors real-world single-submission workflow
No retry, no cherry-picking best runs

§08 · Resources

Code

GitHub repository

github.com/Ethara-Ai/Kraken-Dataset

Data

Dataset on HuggingFace

huggingface.co/datasets/ethara/Kraken

Paper

Research Paper

arxiv.org/abs/2511.06090