Principle 01
Sandboxed Execution
- Every trial runs in an isolated container
- No internet access, no host filesystem leakage
- Reproducible environment via Docker images
How well do AI agents solve hard terminal tasks — from file manipulation to system administration — in real sandboxed environments?
§01 · Overview
MARS is a reinforcement learning environment for training autonomous agents to resolve complex CLI (command line interface) tasks. Each task includes a containerized environment with a human curated problem based on real-world operational scenarios, a natural language problem description, and a Docker container for sandboxed execution. Reward signals are generated using test suites that verify the final container state against expected outcomes. Two frontier models were tested on 10,000 instances covering 6 task categories across 3 difficulty tiers.
§02 · Key metrics
Three numbers that define the scope of MARS.
Instances
10,000
100% reward signal coverage
Models Used
2
GLM-5 & Nova-2-Lite
Categories
6
Difficulty Tiers
3
Easy, Medium, Hard
§03 · Methodology
How the evaluation scores are computed.
Evaluation framework
Principle 01
Principle 02
Principle 03
Principle 04
§04 · Evaluation pipeline
Three stages turn a raw task into a scored RL environment.
Phase 01
Phase 02
Phase 03
§05 · Results
GLM-5 achieves overall pass rate 65%. Nova-2-Lite achieves overall pass rate 45%.
Both models show clear degradation as difficulty increases: from ~86% at Easy to ~33% (GLM-5) and 0% (Nova-2-Lite) at Hard.
GLM-5 achieves higher pass rates at ~48× lower cost. ($0.72 vs $34.89)
Success rate by difficulty tier for two frontier models.
Difficulty breakdown
GLM-5 maintains a consistent lead across all tiers. Nova-2-Lite collapses entirely on Hard environments.
| Difficulty | GLM-5 | Nova-2-Lite |
|---|---|---|
| Easy | 85.7% | 71.4% |
| Medium | 71.4% | 57.1% |
| Hard | 33.3% | 0.0% |
Category breakdown
Performance varies sharply by category. File-operations and algorithms show the highest GLM-5 pass rates. Data-querying and optimization expose the largest model gap.
| Category | GLM-5 | Nova-2-Lite |
|---|---|---|
| algorithms | 100.0% | 100.0% |
| data-querying | 33.3% | 0.0% |
| debugging | 66.7% | 66.7% |
| file-operations | 100.0% | 50.0% |
| optimization | 50.0% | 0.0% |
| software-engineering | 66.7% | 50.0% |
§06 · Dataset Viewer
Sample instances evaluated on GLM-5 and Nova-2-Lite.
| # | Instance | Category | Difficulty | GLM-5 | GLM-5 Time | Nova-2-Lite | Nova Time | Language |
|---|
§07 · Model comparison
GLM-5
Nova-2-Lite
GLM-5 achieves higher pass rates at 48× lower cost.
§08 · Resources