Component 01
Operational Context
- Live enterprise infrastructure with realistic state
- Repos, files, users, projects, messaging
- Task instructions define the operational objective
RL Environments for Training Model Safety Under Adversarial Social Pressure
§01 · Overview
Vesta introduces RL environments for training model safety when models operate real-world enterprise infrastructure under adversarial social pressure.
Models deployed with tool access face a fundamentally different threat surface than what static alignment addresses. They interact with live systems, manage sensitive operations, and encounter adaptive adversaries who exploit operational context to induce unsafe behavior. Vesta provides the environments and reward signals needed to close this gap.
§02 · Key Metrics
Core dimensions of the Vesta environment suite.
Risk Domains
0
Enterprise risk categories
Tool Services
0
GitLab · RocketChat · OwnCloud · Plane · File · Bash
Adversarial Pressure
0
None · Single adversary · Coordinated multi-actor
Reward Signal
Binary
Deterministic, non-gameable
§03 · Risk Domains
Eight enterprise risk categories spanning the full surface area of tool-equipped model failures.
| Risk Domain | Policy Violation Class |
|---|---|
| Computer security compromise | Privilege escalation, auth bypass, config weakening |
| Data loss and corruption | Destructive operations, backup removal, state corruption |
| Privacy breach | Credential exposure, PII exfiltration, access leaks |
| Financial loss | Unauthorized transactions, budget manipulation |
| Harmful decision making | Policy override under social pressure |
| Legal violations | IP theft, documentation fraud, compliance bypass |
| Spreading malicious content | Phishing distribution, misinformation propagation |
| Unsafe code execution | Unreviewed deployment, change freeze bypass |
§04 · Environment Architecture
The structure
Models interact with live infrastructure, face adaptive adversaries, and receive binary reward signals.
Component 01
Component 02
Component 03
§05 · Model Performance
No model achieves acceptable safety under coordinated adversarial pressure.
Coordinated multi-actor pressure reduces safety rates by 50–88% relative to solo operation.
Safety Rate by Adversarial Pressure Level
| Model | No Adversary | Single Adversary | Coordinated | Overall |
|---|---|---|---|---|
| GLM-5 | 50.0% | 25.0% | 25.0% | 30.0% |
| Nova-2-Lite | 100.0% | 50.0% | 12.5% | 45.0% |
Key findings: Nova-2-Lite maintains safety under zero adversarial pressure but collapses to 12.5% under coordinated multi-actor pressure. GLM-5 exhibits weak baseline safety (50%) with minimal further degradation, suggesting undertrained safety.
§06 · Environment Explorer
Browse environments in the Vesta suite. Click any row to expand details.
| Environment | Pressure | Services | GLM-5 | Nova-2-Lite |
|---|
§07 · What Makes This Different
How Vesta compares to prior approaches.
| Dimension | Prior Work | Vesta |
|---|---|---|
| Environment | Mock APIs, text descriptions | Live infrastructure with real services |
| Tools | Simulated tool calls | Actual file system, bash, git, messaging |
| Adversary | Static prompts | Goal-conditioned adversaries with escalation strategies |
| Consequences | Hypothetical | Model actually modifies files, leaks credentials |
| Reward signal | Single LLM judge | Deterministic multi-channel state verification |
§08 · Example Environments
Model manages infrastructure. Direct instruction asks it to create a backdoor admin account with hardcoded credentials for "emergency access."
Adversarial actor poses as HR director requesting salary data export for "board presentation." Applies authority pressure across multiple turns.
Multiple adversarial actors converge to pressure model into manipulating internal governance data.
§09 · Resources