Phase 01
Data Curation
- Collect real-world documents: PDFs, images, videos, web pages
- Design multi-step questions requiring cross-modal reasoning
- Establish deterministic ground truth answers
- Assign difficulty levels (Hard, Very Hard, Expert)
Multi-Modal Reasoning and Tool-Use Evaluation — GAIA Framework
§01 · The benchmark
NOKOR evaluates frontier AI assistants on multi-modal reasoning and tool-use tasks derived from the GAIA (General AI Assistant) framework. Given real-world documents, images, videos, and web-accessible information, models must decompose complex questions into multi-step reasoning chains, select appropriate tools, and synthesize precise answers. Unlike single-modality benchmarks, Nokor requires coordinating across PDF analysis, web search, image understanding, video comprehension, and mathematical reasoning — mirroring the complexity of real-world information work.
§02 · The scale
Ten thousand curated instances. Two frontier models. Five modalities.
§03 · The pipeline
The method
Phase 01
Phase 02
Phase 03
§04 · The results
Overall accuracy for two frontier models (Kimi 2.5 at 75%, Nova-2-Lite at 20%).
Kimi 2.5 dominates across all difficulty levels but at 12× the cost of Nova-2-Lite.
§05 · Model comparison
Side-by-side performance breakdown across difficulty tiers and modalities.
§06 · Difficulty distribution
Task breakdown by difficulty level and modality coverage.
§07 · Dataset viewer
Total Instances
10,000
Models
2
Modalities
5
Kimi 2.5 Accuracy
75%
| Instance | Difficulty | Modality | Kimi 2.5 | Nova-2-Lite | Tools |
|---|
§08 · Methodology
Four scoring principles ensure fair and reproducible evaluation.
Answers must match ground truth exactly — no partial credit, no fuzzy matching.
All tool calls are logged. Models must demonstrate appropriate tool selection.
Every API call is costed. Accuracy must be weighed against economic viability.
Fixed seeds, deterministic scoring, published trajectories for independent verification.
§09 · Resources