TERRA

RL ENVIRONMENT FOR GENERAL AI ASSISTANTS

Multi-Step, Multi-Category Training for General AI Assistants. GAIA Methodology.

TERRA is an RL environment that trains AI assistants on multi-modal reasoning and tool-use tasks following the GAIA methodology. Given real-world documents, images, videos, and web-accessible information, agents must decompose complex questions into multi-step reasoning chains, select appropriate tools, and produce precise answers. Terra requires coordinating across PDF analysis, web search, image understanding, video comprehension, and mathematical reasoning. This mirrors the complexity of real-world information work.

  • Multi-modal tasks across 5 categories: Web Browsing, File Reading, Multi-Modality, Calculation, Self-Contained
  • Tool-use challenges: PDF reader, Web search, Calculator, Image/Video analysis
  • Three difficulty levels: Hard, Very Hard, Expert
  • Exact-match scoring with deterministic ground truth answers
  • GAIA methodology for reproducible multi-step training

Curated instances. Five categories. Three difficulty levels.

20 Instances curated across 5 categories
5 Categories calculation, multimodality, file reading, self-contained, web browsing
3 Difficulty levels Hard · Very Hard · Expert

The method

Three phases turn raw documents into a training environment.

Phase 01

Data Curation

  • Collect real-world documents: PDFs, images, videos, web pages
  • Design multi-step questions requiring cross-modal reasoning
  • Establish deterministic ground truth answers
  • Assign difficulty levels (Level 1, Level 2, Level 3)

Phase 02

Agent Execution

  • Deploy AI agents with access to tools (PDF reader, web search, calculator)
  • Agents decompose questions into multi-step reasoning chains
  • Track tool selection and execution traces
  • Record complete interaction traces for analysis

Phase 03

Reward Signal

  • Compare agent answers against ground truth via exact match
  • Binary reward: correct or incorrect, no partial credit
  • Compute per-model accuracy across difficulty levels
  • Generate per-category and per-difficulty breakdowns

We validated difficulty calibration using two frontier models on the dataset.

Results computed dynamically from the dataset below.

Kimi K2.5 Loading...
Nova-2-Lite Loading...
Fig. 1. GAIA performance comparison across difficulty levels (Kimi K2.5 vs Nova-2-Lite)

Side-by-side performance breakdown across difficulty levels and categories.

Kimi K2.5 Accuracy
Nova-2-Lite Accuracy
Accuracy Gap percentage points
Mean Accuracy (both models)

Pass score breakdown by difficulty level across both models.

Hard
Kimi K2.5
Nova-2-Lite
Combined
Very Hard
Kimi K2.5
Nova-2-Lite
Combined
Expert
Kimi K2.5
Nova-2-Lite
Combined
Hard
Kimi K2.5
Nova-2-Lite
Failure Rate
Very Hard
Kimi K2.5
Nova-2-Lite
Failure Rate
Expert
Kimi K2.5
Nova-2-Lite
Failure Rate
Loading...
Task ID Difficulty Category Kimi K2.5 Nova-2-Lite

Three scoring principles ensure fair and reproducible results.

01

Exact Match

Answers must match ground truth exactly. No partial credit, no fuzzy matching.

02

Tool Transparency

All tool calls are logged. Models must demonstrate appropriate tool selection.

03

Reproducibility

Fixed seeds, deterministic scoring, published solution traces for independent verification.