Phase 01
Instance Preparation
- Identify target open-source libraries
- Create specification PDFs
- Build Dockerfiles & setup scripts for hermetic environments
Library Generation from Scratch
§01 · Overview
Kaiju evaluates frontier AI coding models on whole-library generation from specification. Each instance pairs a specification PDF with hermetic build infrastructure (Dockerfile and setup scripts), requiring models to produce complete, functional implementations that pass comprehensive test suites. Two models (GLM-5 and Nova-2-Lite) are evaluated through a three-stage pipeline: initial generation, lint refinement, and test refinement. Module-level pass rates are reported per stage and difficulty tier.
§02 · Key Metrics
Four numbers that define the scope of Kaiju.
Library Instances
0
100% structural completeness
Frontier Models
0
GLM-5 & Nova-2-Lite
Languages
0
Python, JS, TS, Go, Rust, C, C++, Java
Refinement Stages
0
Generate → Lint → Test
§03 · Pipeline
The method
From spec PDF through code generation to refinement, each stage builds on the last.
Phase 01
Phase 02
Phase 03
§04 · Results
Stage 3 pass rate by number of files affected for two frontier models.
Both models show clear performance degradation as library complexity increases: from ~57% at 1–5 files to ~16% beyond 100 files.
§05 · Dataset Viewer
Browse sample instances in the Kaiju dataset. Click any row to expand full details.
| Instance | Difficulty | Tests | Python | GLM-5 (s3) | Nova-2-Lite (s3) |
|---|
§06 · Pass Rate Distribution
Distribution of best stage-3 pass rate (highest of GLM-5 and Nova-2-Lite) across all evaluated library instances.
≥ 50% Pass
0
50.0% of instances
10–49% Pass
0
50.0% of instances
< 10% Pass
0
0% of instances
0% Pass
0
0% of instances
§07 · Resources