JANUS

MULTIMODAL AGENTIC CAPABILITIES

Can AI agents actively manipulate images and search the web to solve complex visual tasks?

Janus is a reinforcement learning environment for training multimodal agents to solve visual reasoning tasks through tool use. Each instance includes a high-resolution image, a question that cannot be answered without acting on the visual input, and a set of 17 tools spanning image manipulation and web search. Reward signals are generated using 2,000+ stepwise checkpoints that decompose performance into search correctness, visual operation accuracy, and efficiency. Two frontier models were tested on 10,000 instances requiring interleaved visual transformation and information retrieval.

Keep scrolling. The answer is in §05.

Core statistics of the Janus benchmark at scale.

Instances

10,000

visual reasoning tasks

Models Evaluated

2

frontier multimodal agents

Tools Available

17

image manipulation + web search

Checkpoints

2,000+

stepwise verification points

How the metrics and scores are computed. Four principles govern the Janus scoring system.

Principle 01

Dual-Axis Process Verification

  • S-axis (Strategy): Audits Knowledge Expansion — search keywords, reference URLs, expected intermediate answers
  • V-axis (Visual Evidence): Audits Visual Expansion — tool intent (V_tool) and artifact faithfulness (V_true)
  • Scores = fraction of passed checkpoints on each axis

Principle 02

Final Answer Accuracy

  • Computed via normalized exact match against golden answers
  • Supported match types: exact, contains, numeric (with tolerance)
  • Answer standardization eliminates evaluation ambiguity

Principle 03

Efficiency (Overthink Metric)

  • Overthink = max(0, C_agent − C_human) / (C_human + 1)
  • Penalizes redundant tool calls relative to human reference trajectory
  • Reference tool calls per task: 2–3 (human baseline)
  • Max tool calls allowed: 5–7 per instance

Principle 04

Correctness Gating

  • Processed images must pass VLM judge verification
  • Any-pass mechanism: checkpoint passes if any artifact contains required evidence
  • Incorrect visual manipulations are penalized through V_true scoring

The method

Three steps from image to scored result.

Each instance requires active visual manipulation, external knowledge retrieval, and dual-axis process verification.

Step 01

Investigate & Manipulate

  • Agent receives an image with a task requiring active visual manipulation
  • Must identify the correct region, orientation, or enhancement needed
  • Localizes the visual evidence through crop, rotate, flip, or enhance operations

Step 02

Expand & Retrieve

  • Coordinate visual cues with open-web search when external knowledge is needed
  • Extract information from processed images (brand names, years, text)
  • Query web for factual knowledge (founding years, headquarters, historical events)

Step 03

Verify Correctness

  • Process-level verification via dual-axis checkpoints
  • V-axis: Verifies visual tool intent and intermediate artifact faithfulness
  • S-axis: Audits search strategy, keywords, and retrieved information correctness
  • Efficiency measured via Overthink metric against human reference trajectories

Kimi K2.5 resolves 14 of 20. Nova-2-Lite resolves 5 of 20.

Kimi is 2.8× more accurate but Nova uses ~2× more tool calls per instance (13.1 vs 6.2 avg).

Fig. 1 — Dual-Axis Process Scores (V-axis & S-axis) by Domain
Fig. 2 — Failure Mode Distribution (% of instances)

20 instances across 3 difficulty levels (L1, L2, L3). Click any row for full run metadata.

Dataset viewer for 20 instances evaluated against two models.
Instance Domain Task Golden Nova Kimi

Head-to-head breakdown of both evaluated frontier models.

Nova-2-Lite

Overall Accuracy
25% (5/20)
V-axis Score
0.244
S-axis Score
0.417
Overthink Ratio
4.79
Avg Tool Calls
13.1
Avg Cost/Instance
~$0.12

Kimi K2.5

Overall Accuracy
70% (14/20)
V-axis Score
0.572
S-axis Score
0.694
Overthink Ratio
2.23
Avg Tool Calls
6.2
Avg Cost/Instance
~$0.06