TERRA · RL Environment for General AI Assistants

§01 · The environment

TERRA is an RL environment that trains AI assistants on multi-modal reasoning and tool-use tasks following the GAIA methodology. Given real-world documents, images, videos, and web-accessible information, agents must decompose complex questions into multi-step reasoning chains, select appropriate tools, and produce precise answers. Terra requires coordinating across PDF analysis, web search, image understanding, video comprehension, and mathematical reasoning. This mirrors the complexity of real-world information work.

Multi-modal tasks across 5 categories: Web Browsing, File Reading, Multi-Modality, Calculation, Self-Contained
Tool-use challenges: PDF reader, Web search, Calculator, Image/Video analysis
Three difficulty levels: Hard, Very Hard, Expert
Exact-match scoring with deterministic ground truth answers
GAIA methodology for reproducible multi-step training

§02 · The scale

Curated instances. Five categories. Three difficulty levels.

Instances curated across 5 categories

Categories calculation, multimodality, file reading, self-contained, web browsing

Difficulty levels Hard · Very Hard · Expert

§03 · The pipeline

The method

Three phases turn raw documents into a training environment.

Phase 01

Data Curation

Collect real-world documents: PDFs, images, videos, web pages
Design multi-step questions requiring cross-modal reasoning
Establish deterministic ground truth answers
Assign difficulty levels (Level 1, Level 2, Level 3)

Phase 02

Agent Execution

Deploy AI agents with access to tools (PDF reader, web search, calculator)
Agents decompose questions into multi-step reasoning chains
Track tool selection and execution traces
Record complete interaction traces for analysis

Phase 03

Reward Signal

Compare agent answers against ground truth via exact match
Binary reward: correct or incorrect, no partial credit
Compute per-model accuracy across difficulty levels
Generate per-category and per-difficulty breakdowns

§04 · Baseline validation

We validated difficulty calibration using two frontier models on the dataset.

Results computed dynamically from the dataset below.

Kimi K2.5 — Loading...

Nova-2-Lite — Loading...

GAIA Performance. Model accuracy across difficulty levels — Fig. 1. GAIA performance comparison across difficulty levels (Kimi K2.5 vs Nova-2-Lite)

§05 · Model comparison

Side-by-side performance breakdown across difficulty levels and categories.

— Kimi K2.5 Accuracy —

— Nova-2-Lite Accuracy —

— Accuracy Gap percentage points

— Mean Accuracy (both models)

§06 · Difficulty distribution

Pass score breakdown by difficulty level across both models.

Hard —

— — Kimi K2.5

— — Nova-2-Lite

— — Combined

Very Hard —

— — Kimi K2.5

— — Nova-2-Lite

— — Combined

Expert —

— — Kimi K2.5

— — Nova-2-Lite

— — Combined

§07 · Dataset viewer

Hard —

— Kimi K2.5

— Nova-2-Lite

— Failure Rate

Very Hard —

— Kimi K2.5

— Nova-2-Lite

— Failure Rate

Expert —

— Kimi K2.5

— Nova-2-Lite

— Failure Rate

Loading...

Task ID	Difficulty	Category	Kimi K2.5	Nova-2-Lite

§08 · Methodology

Three scoring principles ensure fair and reproducible results.

01

Exact Match

Answers must match ground truth exactly. No partial credit, no fuzzy matching.

02

Tool Transparency

All tool calls are logged. Models must demonstrate appropriate tool selection.

03

Reproducibility

Fixed seeds, deterministic scoring, published solution traces for independent verification.

§09 · Resources

Code

Source on GitHub

github.com/Ethara-Ai/terra

Data

Dataset on HuggingFace

huggingface.co/datasets/ethara/terra