MARS — Agentic Code RL Environments

§01 · Overview

MARS is a reinforcement learning environment for training autonomous agents to resolve complex CLI (command line interface) tasks. Each task includes a containerized environment with a human curated problem based on real-world operational scenarios, a natural language problem description, and a Docker container for sandboxed execution. Reward signals are generated using test suites that verify the final container state against expected outcomes. Two frontier models were tested on 10,000 instances covering 6 task categories across 3 difficulty tiers.

Keep scrolling. The results are in §05.

§02 · Key metrics

Three numbers that define the scope of MARS.

Instances

10,000

100% reward signal coverage

Models Used

2

GLM-5 & Nova-2-Lite

Four principles govern the MARS evaluation system.

Principle 01

Sandboxed Execution

Every trial runs in an isolated container
No internet access, no host filesystem leakage
Reproducible environment via Docker images

Principle 02

Automated Verification

Test scripts execute automatically post-agent
Binary reward: pass (1) or fail (0)
No human evaluation in the loop

Principle 03

Single Attempt

Each agent gets exactly one attempt per task
Mirrors real-world single-submission workflow
No retry, no cherry-picking best runs

Principle 04

Cost Tracking

Full token usage (input + output) recorded
USD cost computed per trial via LLM pricing
Enables cost-efficiency comparisons

§04 · Evaluation pipeline

Three stages turn a raw task into a scored RL environment.

Phase 01

Environment Construction

Identify target real-world programming problems
Write human curated reward signals (deterministic test harnesses)
Package sandboxed runtimes with graded difficulty metadata

Phase 02

Agent Execution

Deploy frontier models into isolated environments
Agents produce solutions under resource constraints
Capture per-environment outputs

Phase 03

Reward Signal Evaluation

Execute reward signal harnesses against agent outputs
Score binary pass/fail per test case
Aggregate to environment-level and category-level pass rates

§05 · Results

GLM-5 achieves overall pass rate 65%. Nova-2-Lite achieves overall pass rate 45%.

Both models show clear degradation as difficulty increases: from ~86% at Easy to ~33% (GLM-5) and 0% (Nova-2-Lite) at Hard.

GLM-5 achieves higher pass rates at ~48× lower cost. ($0.72 vs $34.89)

Success rate by difficulty tier for two frontier models.

Success rate by difficulty level for GLM-5 and Nova-2-Lite — Fig. 1 — Success Rate by Difficulty

Success rate by task category for GLM-5 and Nova-2-Lite — Fig. 2 — Success Rate by Category

Success rate vs reward signal count (test cases) for GLM-5 and Nova-2-Lite — Fig. 3 — Success Rate by Reward Signal Count

Difficulty breakdown

GLM-5 maintains a consistent lead across all tiers. Nova-2-Lite collapses entirely on Hard environments.

Difficulty	GLM-5	Nova-2-Lite
Easy	85.7%	71.4%
Medium	71.4%	57.1%
Hard	33.3%	0.0%

Category breakdown

Performance varies sharply by category. File-operations and algorithms show the highest GLM-5 pass rates. Data-querying and optimization expose the largest model gap.

Category	GLM-5	Nova-2-Lite
algorithms	100.0%	100.0%
data-querying	33.3%	0.0%
debugging	66.7%	66.7%
file-operations	100.0%	50.0%
optimization	50.0%	0.0%
software-engineering	66.7%	50.0%

§06 · Dataset Viewer

Sample instances evaluated on GLM-5 and Nova-2-Lite.

Dataset viewer: 20 sample instances evaluated on GLM-5 and Nova-2-Lite models.
#	Instance	Category	Difficulty	GLM-5	GLM-5 Time	Nova-2-Lite	Nova Time	Language

§07 · Model comparison

GLM-5

Overall Pass Rate: 65.0%
Easy: 85.7%
Medium: 71.4%
Hard: 33.3%
Total Cost: $0.72

Nova-2-Lite

Overall Pass Rate: 45.0%
Easy: 71.4%
Medium: 57.1%
Hard: 0.0%
Total Cost: $34.89

GLM-5 achieves higher pass rates at 48× lower cost.

§08 · Resources

GitHub

Trajectories

github.com/Ethara-Ai/mars-results

HuggingFace

Dataset

huggingface.co/datasets/ethara/mars

Paper

Research Paper

arxiv.org/abs/2601.11868