MARS

REINFORCEMENT

How well do AI agents solve hard terminal tasks — from file manipulation to system administration — in real sandboxed environments?

MARS is a reinforcement learning environment for training autonomous agents to resolve complex CLI (command line interface) tasks. Each task includes a containerized environment with a human curated problem based on real-world operational scenarios, a natural language problem description, and a Docker container for sandboxed execution. Reward signals are generated using test suites that verify the final container state against expected outcomes. Two frontier models were tested on 10,000 instances covering 6 task categories across 3 difficulty tiers.

Keep scrolling. The results are in §05.

Three numbers that define the scope of MARS.

Instances

10,000

100% reward signal coverage

Models Used

2

GLM-5 & Nova-2-Lite

Categories

6

Difficulty Tiers

3

Easy, Medium, Hard

How the evaluation scores are computed.

Evaluation framework

Four principles govern the MARS evaluation system.

Principle 01

Sandboxed Execution

  • Every trial runs in an isolated container
  • No internet access, no host filesystem leakage
  • Reproducible environment via Docker images

Principle 02

Automated Verification

  • Test scripts execute automatically post-agent
  • Binary reward: pass (1) or fail (0)
  • No human evaluation in the loop

Principle 03

Single Attempt

  • Each agent gets exactly one attempt per task
  • Mirrors real-world single-submission workflow
  • No retry, no cherry-picking best runs

Principle 04

Cost Tracking

  • Full token usage (input + output) recorded
  • USD cost computed per trial via LLM pricing
  • Enables cost-efficiency comparisons

Three stages turn a raw task into a scored RL environment.

Phase 01

Environment Construction

  • Identify target real-world programming problems
  • Write human curated reward signals (deterministic test harnesses)
  • Package sandboxed runtimes with graded difficulty metadata

Phase 02

Agent Execution

  • Deploy frontier models into isolated environments
  • Agents produce solutions under resource constraints
  • Capture per-environment outputs

Phase 03

Reward Signal Evaluation

  • Execute reward signal harnesses against agent outputs
  • Score binary pass/fail per test case
  • Aggregate to environment-level and category-level pass rates

GLM-5 achieves overall pass rate 65%. Nova-2-Lite achieves overall pass rate 45%.

Both models show clear degradation as difficulty increases: from ~86% at Easy to ~33% (GLM-5) and 0% (Nova-2-Lite) at Hard.

GLM-5 achieves higher pass rates at ~48× lower cost. ($0.72 vs $34.89)

Success rate by difficulty tier for two frontier models.

Fig. 1 — Success Rate by Difficulty
Fig. 2 — Success Rate by Category
Fig. 3 — Success Rate by Reward Signal Count

Difficulty breakdown

GLM-5 maintains a consistent lead across all tiers. Nova-2-Lite collapses entirely on Hard environments.

Difficulty GLM-5 Nova-2-Lite
Easy85.7%71.4%
Medium71.4%57.1%
Hard33.3%0.0%

Category breakdown

Performance varies sharply by category. File-operations and algorithms show the highest GLM-5 pass rates. Data-querying and optimization expose the largest model gap.

Category GLM-5 Nova-2-Lite
algorithms100.0%100.0%
data-querying33.3%0.0%
debugging66.7%66.7%
file-operations100.0%50.0%
optimization50.0%0.0%
software-engineering66.7%50.0%

Sample instances evaluated on GLM-5 and Nova-2-Lite.

Dataset viewer: 20 sample instances evaluated on GLM-5 and Nova-2-Lite models.
# Instance Category Difficulty GLM-5 GLM-5 Time Nova-2-Lite Nova Time Language

GLM-5

Overall Pass Rate
65.0%
Easy
85.7%
Medium
71.4%
Hard
33.3%
Total Cost
$0.72

Nova-2-Lite

Overall Pass Rate
45.0%
Easy
71.4%
Medium
57.1%
Hard
0.0%
Total Cost
$34.89

GLM-5 achieves higher pass rates at 48× lower cost.