Kaiju — Library Generation from Scratch

§01 · Overview

Kaiju evaluates frontier AI coding models on whole-library generation from specification. Each instance pairs a specification PDF with hermetic build infrastructure (Dockerfile and setup scripts), requiring models to produce complete, functional implementations that pass comprehensive test suites. Two models (GLM-5 and Nova-2-Lite) are evaluated through a three-stage pipeline: initial generation, lint refinement, and test refinement. Module-level pass rates are reported per stage and difficulty tier.

§02 · Key Metrics

Four numbers that define the scope of Kaiju.

Library Instances

0

100% structural completeness

Frontier Models

0

GLM-5 & Nova-2-Lite

Languages

0

Python, JS, TS, Go, Rust, C, C++, Java

Refinement Stages

0

Generate → Lint → Test

§03 · Pipeline

The method

Three stages turn a specification into a working library.

From spec PDF through code generation to refinement, each stage builds on the last.

Phase 01

Instance Preparation

Identify target open-source libraries
Create specification PDFs
Build Dockerfiles & setup scripts for hermetic environments

Phase 02

Code Generation

Run AI models in sandboxes
Generate complete library implementations from specifications
Produce per-module outputs

Phase 03

Evaluation & Refinement

Execute test suites, score pass rates
Run lint and test refinement stages
Produce confidence scores

§04 · Results

Stage 3 pass rate by number of files affected for two frontier models.

Both models show clear performance degradation as library complexity increases: from ~57% at 1–5 files to ~16% beyond 100 files.

Library Generation from Scratch: Stage 3 Pass Rate by Files Affected — Fig. 1 — Stage 3 Pass Rate by Files Affected

Fig. 2 — Mean Pass Rate by Model and Stage

Stage-wise Performance by Difficulty — Fig. 3 — Stage Performance by Difficulty

§05 · Dataset Viewer

Browse sample instances in the Kaiju dataset. Click any row to expand full details.

Loading...

Instance	Difficulty	Tests	Python	GLM-5 (s3)	Nova-2-Lite (s3)

§06 · Pass Rate Distribution

Distribution of best stage-3 pass rate (highest of GLM-5 and Nova-2-Lite) across all evaluated library instances.

≥ 50% Pass

0

50.0% of instances

10–49% Pass

0

50.0% of instances

< 10% Pass

0

0% of instances

0% Pass

0

0% of instances

0 Mean Pass Rate

Min 5.2%

Max 96.6%

§07 · Resources

GitHub

Trajectories

github.com/Ethara-Ai/kaiju_ots

HuggingFace

Dataset

huggingface.co/datasets/ethara/Kaiju