Phase 01
Data Curation
- Collect real-world documents: PDFs, images, videos, web pages
- Design multi-step questions requiring cross-modal reasoning
- Establish deterministic ground truth answers
- Assign difficulty levels (Level 1, Level 2, Level 3)
Multi-Step, Multi-Category Training for General AI Assistants. GAIA Methodology.
§01 · The environment
TERRA is an RL environment that trains AI assistants on multi-modal reasoning and tool-use tasks following the GAIA methodology. Given real-world documents, images, videos, and web-accessible information, agents must decompose complex questions into multi-step reasoning chains, select appropriate tools, and produce precise answers. Terra requires coordinating across PDF analysis, web search, image understanding, video comprehension, and mathematical reasoning. This mirrors the complexity of real-world information work.
§02 · The scale
Curated instances. Five categories. Three difficulty levels.
§03 · The pipeline
The method
Phase 01
Phase 02
Phase 03
§04 · Baseline validation
We validated difficulty calibration using two frontier models on the dataset.
Results computed dynamically from the dataset below.
§05 · Model comparison
Side-by-side performance breakdown across difficulty levels and categories.
§06 · Difficulty distribution
Pass score breakdown by difficulty level across both models.
§07 · Dataset viewer
| Task ID | Difficulty | Category | Kimi K2.5 | Nova-2-Lite |
|---|
§08 · Methodology
Three scoring principles ensure fair and reproducible results.
Answers must match ground truth exactly. No partial credit, no fuzzy matching.
All tool calls are logged. Models must demonstrate appropriate tool selection.
Fixed seeds, deterministic scoring, published solution traces for independent verification.
§09 · Resources