Janus · Multimodal Agentic Capabilities

§01 · Overview

Janus is a reinforcement learning environment for training multimodal agents to solve visual reasoning tasks through tool use. Each instance includes a high-resolution image, a question that cannot be answered without acting on the visual input, and a set of 17 tools spanning image manipulation and web search. Reward signals are generated using 2,000+ stepwise checkpoints that decompose performance into search correctness, visual operation accuracy, and efficiency. Two frontier models were tested on 10,000 instances requiring interleaved visual transformation and information retrieval.

Keep scrolling. The answer is in §05.

§02 · Key metrics

Core statistics of the Janus benchmark at scale.

Instances

10,000

visual reasoning tasks

Models Evaluated

2

frontier multimodal agents

Tools Available

17

image manipulation + web search

Checkpoints

2,000+

stepwise verification points

§03 · Methodology

How the metrics and scores are computed. Four principles govern the Janus scoring system.

Principle 01

Dual-Axis Process Verification

S-axis (Strategy): Audits Knowledge Expansion — search keywords, reference URLs, expected intermediate answers
V-axis (Visual Evidence): Audits Visual Expansion — tool intent (V_tool) and artifact faithfulness (V_true)
Scores = fraction of passed checkpoints on each axis

Principle 02

Final Answer Accuracy

Computed via normalized exact match against golden answers
Supported match types: exact, contains, numeric (with tolerance)
Answer standardization eliminates evaluation ambiguity

Principle 03

Efficiency (Overthink Metric)

Overthink = max(0, C_agent − C_human) / (C_human + 1)
Penalizes redundant tool calls relative to human reference trajectory
Reference tool calls per task: 2–3 (human baseline)
Max tool calls allowed: 5–7 per instance

Principle 04

Correctness Gating

Processed images must pass VLM judge verification
Any-pass mechanism: checkpoint passes if any artifact contains required evidence
Incorrect visual manipulations are penalized through V_true scoring

§04 · Evaluation pipeline

The method

Three steps from image to scored result.

Each instance requires active visual manipulation, external knowledge retrieval, and dual-axis process verification.

Step 01

Investigate & Manipulate

Agent receives an image with a task requiring active visual manipulation
Must identify the correct region, orientation, or enhancement needed
Localizes the visual evidence through crop, rotate, flip, or enhance operations

Step 02

Expand & Retrieve

Coordinate visual cues with open-web search when external knowledge is needed
Extract information from processed images (brand names, years, text)
Query web for factual knowledge (founding years, headquarters, historical events)

Step 03

Verify Correctness

Process-level verification via dual-axis checkpoints
V-axis: Verifies visual tool intent and intermediate artifact faithfulness
S-axis: Audits search strategy, keywords, and retrieved information correctness
Efficiency measured via Overthink metric against human reference trajectories

§05 · Results

Kimi K2.5 resolves 14 of 20. Nova-2-Lite resolves 5 of 20.

Kimi is 2.8× more accurate but Nova uses ~2× more tool calls per instance (13.1 vs 6.2 avg).

Bar chart showing V-axis and S-axis process verification scores across task domains for both models — Fig. 1 — Dual-Axis Process Scores (V-axis & S-axis) by Domain

Line chart comparing failure mode distribution between Kimi K2.5 and Nova 2 Lite across seven failure categories — Fig. 2 — Failure Mode Distribution (% of instances)

§06 · Dataset Viewer

20 instances across 3 difficulty levels (L1, L2, L3). Click any row for full run metadata.

Dataset viewer for 20 instances evaluated against two models.
Instance	Domain	Task	Golden	Nova	Kimi

§07 · Model comparison

Head-to-head breakdown of both evaluated frontier models.

Nova-2-Lite

Overall Accuracy: 25% (5/20)
V-axis Score: 0.244
S-axis Score: 0.417
Overthink Ratio: 4.79
Avg Tool Calls: 13.1
Avg Cost/Instance: ~$0.12

Kimi K2.5

Overall Accuracy: 70% (14/20)
V-axis Score: 0.572
S-axis Score: 0.694
Overthink Ratio: 2.23
Avg Tool Calls: 6.2
Avg Cost/Instance: ~$0.06

§08 · Resources

GitHub

Trajectories

github.com/Ethara-Ai/Janus

HuggingFace

Dataset

huggingface.co/datasets/ethara/Janus

Paper

Research Paper

arxiv.org/abs/2604.03016