KAIJU

GENERATION

Library Generation from Scratch

Kaiju evaluates frontier AI coding models on whole-library generation from specification. Each instance pairs a specification PDF with hermetic build infrastructure (Dockerfile and setup scripts), requiring models to produce complete, functional implementations that pass comprehensive test suites. Two models (GLM-5 and Nova-2-Lite) are evaluated through a three-stage pipeline: initial generation, lint refinement, and test refinement. Module-level pass rates are reported per stage and difficulty tier.

Four numbers that define the scope of Kaiju.

Library Instances

0

100% structural completeness

Frontier Models

0

GLM-5 & Nova-2-Lite

Languages

0

Python, JS, TS, Go, Rust, C, C++, Java

Refinement Stages

0

Generate → Lint → Test

The method

Three stages turn a specification into a working library.

From spec PDF through code generation to refinement, each stage builds on the last.

Phase 01

Instance Preparation

  • Identify target open-source libraries
  • Create specification PDFs
  • Build Dockerfiles & setup scripts for hermetic environments

Phase 02

Code Generation

  • Run AI models in sandboxes
  • Generate complete library implementations from specifications
  • Produce per-module outputs

Phase 03

Evaluation & Refinement

  • Execute test suites, score pass rates
  • Run lint and test refinement stages
  • Produce confidence scores

Stage 3 pass rate by number of files affected for two frontier models.

Both models show clear performance degradation as library complexity increases: from ~57% at 1–5 files to ~16% beyond 100 files.

Fig. 1 — Stage 3 Pass Rate by Files Affected
Fig. 2 — Mean Pass Rate by Model and Stage
Fig. 3 — Stage Performance by Difficulty

Browse sample instances in the Kaiju dataset. Click any row to expand full details.

Loading...
Instance Difficulty Tests Python GLM-5 (s3) Nova-2-Lite (s3)

Distribution of best stage-3 pass rate (highest of GLM-5 and Nova-2-Lite) across all evaluated library instances.

≥ 50% Pass

0

50.0% of instances

10–49% Pass

0

50.0% of instances

< 10% Pass

0

0% of instances

0% Pass

0

0% of instances

0 Mean Pass Rate
Min 5.2%
Max 96.6%