NOKOR — General AI Assistant Benchmark

§01 · The benchmark

NOKOR evaluates frontier AI assistants on multi-modal reasoning and tool-use tasks derived from the GAIA (General AI Assistant) framework. Given real-world documents, images, videos, and web-accessible information, models must decompose complex questions into multi-step reasoning chains, select appropriate tools, and synthesize precise answers. Unlike single-modality benchmarks, Nokor requires coordinating across PDF analysis, web search, image understanding, video comprehension, and mathematical reasoning — mirroring the complexity of real-world information work.

Multi-modal evaluation across 5 modalities: PDF, Web, Image, Video, Reasoning
Tool-use assessment: PDF reader, Web search, Calculator, Image/Video analysis
Three difficulty tiers: Hard, Very Hard, Expert
Exact-match scoring with deterministic ground truth answers
Cost and latency tracking per model per instance
GAIA methodology for reproducible multi-step evaluation

§02 · The scale

Ten thousand curated instances. Two frontier models. Five modalities.

Instances curated across 5 modalities

Models evaluated Kimi 2.5 · Nova-2-Lite

Evaluations completed 2 models × 10,000 instances

§03 · The pipeline

The method

Three phases turn raw documents into scored evaluations.

Phase 01

Data Curation

Collect real-world documents: PDFs, images, videos, web pages
Design multi-step questions requiring cross-modal reasoning
Establish deterministic ground truth answers
Assign difficulty levels (Hard, Very Hard, Expert)

Phase 02

Agent Execution

Deploy AI agents with access to tools (PDF reader, web search, calculator)
Agents decompose questions into multi-step reasoning chains
Track tool selection, execution time, and API costs
Record complete interaction trajectories

Phase 03

Scoring

Compare agent answers against ground truth (exact match)
Compute per-model accuracy across difficulty levels
Analyze cost-efficiency and latency trade-offs
Generate per-modality and per-difficulty breakdowns

§04 · The results

Overall accuracy for two frontier models (Kimi 2.5 at 75%, Nova-2-Lite at 20%).

Kimi 2.5 dominates across all difficulty levels but at 12× the cost of Nova-2-Lite.

Kimi 2.5 75% 15/20 pass · $0.36/instance · 420s avg

Nova-2-Lite 20% 4/20 pass · $0.03/instance · 280s avg

§05 · Model comparison

Side-by-side performance breakdown across difficulty tiers and modalities.

75 Kimi 2.5 Accuracy 15/20 PASS

20 Nova-2-Lite Accuracy 4/20 PASS

12 Cost Ratio (Kimi/Nova) $0.36 vs $0.03

55 Accuracy Gap percentage points

47.5% Mean Accuracy (both models)

§06 · Difficulty distribution

Task breakdown by difficulty level and modality coverage.

25% Hard 5 instances

40% Very Hard 8 instances

35% Expert 7 instances

34%PDF

38%Web

15%Image

8%Video

5%Reasoning

§07 · Dataset viewer

Total Instances

10,000

Models

2

Modalities

5

Kimi 2.5 Accuracy

75%

Loading...

Instance	Difficulty	Modality	Kimi 2.5	Nova-2-Lite	Tools

§08 · Methodology

Four scoring principles ensure fair and reproducible evaluation.

01

Exact Match

Answers must match ground truth exactly — no partial credit, no fuzzy matching.

02

Tool Transparency

All tool calls are logged. Models must demonstrate appropriate tool selection.

03

Cost Tracking

Every API call is costed. Accuracy must be weighed against economic viability.

04

Reproducibility

Fixed seeds, deterministic scoring, published trajectories for independent verification.

§09 · Resources

Code

Source on GitHub

github.com/Ethara-Ai/Nokor

Data

Dataset on HuggingFace

huggingface.co/datasets/ethara/Nokor