NOKOR

GENERAL AI ASSISTANT BENCHMARK

Multi-Modal Reasoning and Tool-Use Evaluation — GAIA Framework

NOKOR evaluates frontier AI assistants on multi-modal reasoning and tool-use tasks derived from the GAIA (General AI Assistant) framework. Given real-world documents, images, videos, and web-accessible information, models must decompose complex questions into multi-step reasoning chains, select appropriate tools, and synthesize precise answers. Unlike single-modality benchmarks, Nokor requires coordinating across PDF analysis, web search, image understanding, video comprehension, and mathematical reasoning — mirroring the complexity of real-world information work.

  • Multi-modal evaluation across 5 modalities: PDF, Web, Image, Video, Reasoning
  • Tool-use assessment: PDF reader, Web search, Calculator, Image/Video analysis
  • Three difficulty tiers: Hard, Very Hard, Expert
  • Exact-match scoring with deterministic ground truth answers
  • Cost and latency tracking per model per instance
  • GAIA methodology for reproducible multi-step evaluation

Ten thousand curated instances. Two frontier models. Five modalities.

10,000 Instances curated across 5 modalities
2 Models evaluated Kimi 2.5 · Nova-2-Lite
20,000 Evaluations completed 2 models × 10,000 instances

The method

Three phases turn raw documents into scored evaluations.

Phase 01

Data Curation

  • Collect real-world documents: PDFs, images, videos, web pages
  • Design multi-step questions requiring cross-modal reasoning
  • Establish deterministic ground truth answers
  • Assign difficulty levels (Hard, Very Hard, Expert)

Phase 02

Agent Execution

  • Deploy AI agents with access to tools (PDF reader, web search, calculator)
  • Agents decompose questions into multi-step reasoning chains
  • Track tool selection, execution time, and API costs
  • Record complete interaction trajectories

Phase 03

Scoring

  • Compare agent answers against ground truth (exact match)
  • Compute per-model accuracy across difficulty levels
  • Analyze cost-efficiency and latency trade-offs
  • Generate per-modality and per-difficulty breakdowns

Overall accuracy for two frontier models (Kimi 2.5 at 75%, Nova-2-Lite at 20%).

Kimi 2.5 dominates across all difficulty levels but at 12× the cost of Nova-2-Lite.

Kimi 2.5 75% 15/20 pass · $0.36/instance · 420s avg
Nova-2-Lite 20% 4/20 pass · $0.03/instance · 280s avg

Side-by-side performance breakdown across difficulty tiers and modalities.

75 Kimi 2.5 Accuracy 15/20 PASS
20 Nova-2-Lite Accuracy 4/20 PASS
12 Cost Ratio (Kimi/Nova) $0.36 vs $0.03
55 Accuracy Gap percentage points
47.5% Mean Accuracy (both models)

Task breakdown by difficulty level and modality coverage.

25% Hard 5 instances
40% Very Hard 8 instances
35% Expert 7 instances
34%PDF
38%Web
15%Image
8%Video
5%Reasoning

Total Instances

10,000

Models

2

Modalities

5

Kimi 2.5 Accuracy

75%

Loading...
Instance Difficulty Modality Kimi 2.5 Nova-2-Lite Tools

Four scoring principles ensure fair and reproducible evaluation.

01

Exact Match

Answers must match ground truth exactly — no partial credit, no fuzzy matching.

02

Tool Transparency

All tool calls are logged. Models must demonstrate appropriate tool selection.

03

Cost Tracking

Every API call is costed. Accuracy must be weighed against economic viability.

04

Reproducibility

Fixed seeds, deterministic scoring, published trajectories for independent verification.