§01 · Overview
Janus is a reinforcement learning environment for training multimodal agents to solve visual reasoning tasks through tool use. Each instance includes a high-resolution image, a question that cannot be answered without acting on the visual input, and a set of 17 tools spanning image manipulation and web search. Reward signals are generated using 2,000+ stepwise checkpoints that decompose performance into search correctness, visual operation accuracy, and efficiency. Two frontier models were tested on 10,000 instances requiring interleaved visual transformation and information retrieval.
Keep scrolling. The answer is in §05 .
§02 · Key metrics
Core statistics of the Janus benchmark at scale.
Instances
10,000
visual reasoning tasks
Models Evaluated
2
frontier multimodal agents
Tools Available
17
image manipulation + web search
Checkpoints
2,000+
stepwise verification points
§03 · Methodology
How the metrics and scores are computed. Four principles govern the Janus scoring system.
Principle 01
Dual-Axis Process Verification
S-axis (Strategy): Audits Knowledge Expansion — search keywords, reference URLs, expected intermediate answers
V-axis (Visual Evidence): Audits Visual Expansion — tool intent (V_tool) and artifact faithfulness (V_true)
Scores = fraction of passed checkpoints on each axis
Principle 02
Final Answer Accuracy
Computed via normalized exact match against golden answers
Supported match types: exact, contains, numeric (with tolerance)
Answer standardization eliminates evaluation ambiguity
Principle 03
Efficiency (Overthink Metric)
Overthink = max(0, C_agent − C_human) / (C_human + 1)
Penalizes redundant tool calls relative to human reference trajectory
Reference tool calls per task: 2–3 (human baseline)
Max tool calls allowed: 5–7 per instance
Principle 04
Correctness Gating
Processed images must pass VLM judge verification
Any-pass mechanism: checkpoint passes if any artifact contains required evidence
Incorrect visual manipulations are penalized through V_true scoring
§04 · Evaluation pipeline
The method
Three steps from image to scored result.
Each instance requires active visual manipulation, external knowledge retrieval, and dual-axis process verification.
Step 01
Investigate & Manipulate
Agent receives an image with a task requiring active visual manipulation
Must identify the correct region, orientation, or enhancement needed
Localizes the visual evidence through crop, rotate, flip, or enhance operations
Step 02
Expand & Retrieve
Coordinate visual cues with open-web search when external knowledge is needed
Extract information from processed images (brand names, years, text)
Query web for factual knowledge (founding years, headquarters, historical events)
Step 03
Verify Correctness
Process-level verification via dual-axis checkpoints
V-axis : Verifies visual tool intent and intermediate artifact faithfulness
S-axis : Audits search strategy, keywords, and retrieved information correctness
Efficiency measured via Overthink metric against human reference trajectories
§05 · Results
Kimi K2.5 resolves 14 of 20 .
Nova-2-Lite resolves 5 of 20 .
Kimi is 2.8× more accurate but Nova uses ~2× more tool calls per instance (13.1 vs 6.2 avg).
Click to expand
Fig. 1 — Dual-Axis Process Scores (V-axis & S-axis) by Domain
Click to expand
Fig. 2 — Failure Mode Distribution (% of instances)
§06 · Dataset Viewer
20 instances across 3 difficulty levels (L1, L2, L3). Click any row for full run metadata.
All Domains
Aviation
Automotive
Document
Electronics
Entertainment
General
Numismatics
Retail
Sports
Technology
Telecom
Urban
Visual Puzzle
Dataset viewer for 20 instances evaluated against two models.
Instance
Domain
Task
Golden
Nova
Kimi
§07 · Model comparison
Head-to-head breakdown of both evaluated frontier models.
Nova-2-Lite
Overall Accuracy 25% (5/20)
V-axis Score 0.244
S-axis Score 0.417
Overthink Ratio 4.79
Avg Tool Calls 13.1
Avg Cost/Instance ~$0.12
Kimi K2.5
Overall Accuracy 70% (14/20)
V-axis Score 0.572
S-axis Score 0.694
Overthink Ratio 2.23
Avg Tool Calls 6.2
Avg Cost/Instance ~$0.06