Environments, Benchmarks & Data Generation

Hermes Agent includes a full environment framework that connects its tool-calling capabilities to the Atropos RL training framework. This enables three workflows:

RL Training — Train language models on multi-turn agentic tasks with GRPO
Benchmarks — Evaluate models on standardised agentic benchmarks
Data Generation — Generate SFT training data from agent rollouts

All three share the same core: an environment class that defines tasks, runs an agent loop, and scores the output.

Repo environments vs RL training tools

The Python environment framework documented here lives under the repo's environments/ directory and is the implementation-level API for Hermes/Atropos integration. This is separate from the user-facing rl_* tools, which operate as an orchestration surface for remote RL training workflows.

Quick Links

Want to run benchmarks? Jump to Available Benchmarks
Want to train with RL? See RL Training Tools for the agent-driven interface, or Running Environments for manual execution
Want to create a new environment? See Creating Environments

Architecture

The environment system is built on a three-layer inheritance chain:

BaseEnv (Atropos)

The foundation from atroposlib. Provides:

Server management — connects to OpenAI-compatible APIs (VLLM, SGLang, OpenRouter)
Worker scheduling — parallel rollout coordination
Wandb integration — metrics logging and rollout visualisation
CLI interface — three subcommands: serve, process, evaluate
Eval logging — evaluate_log() saves results to JSON + JSONL

HermesAgentBaseEnv

The hermes-agent layer (environments/hermes_base_env.py). Adds:

Terminal backend configuration — sets TERMINAL_ENV for sandboxed execution (local, Docker, Modal, Daytona, SSH, Singularity)
Tool resolution — _resolve_tools_for_group() calls hermes-agent's get_tool_definitions() to get the right tool schemas based on enabled/disabled toolsets
Agent loop integration — collect_trajectory() runs HermesAgentLoop and scores the result
Two-phase operation — Phase 1 (OpenAI server) for eval/SFT, Phase 2 (VLLM ManagedServer) for full RL with logprobs
Async safety patches — monkey-patches Modal backend to work inside Atropos's event loop

Concrete Environments

Your environment inherits from HermesAgentBaseEnv and implements five methods:

Method	Purpose
`setup()`	Load dataset, initialise state
`get_next_item()`	Return the next item for rollout
`format_prompt(item)`	Convert an item into the user message
`compute_reward(item, result, ctx)`	Score the rollout (0.0–1.0)
`evaluate()`	Periodic evaluation logic

Core Components

Agent Loop

HermesAgentLoop (environments/agent_loop.py) is the reusable multi-turn agent engine. It runs the same tool-calling pattern as hermes-agent's main loop:

Send messages + tool schemas to the API via server.chat_completion()
If the response contains tool_calls, dispatch each via handle_function_call()
Append tool results to the conversation, go back to step 1
If no tool_calls, the agent is done

Tool calls execute in a thread pool (ThreadPoolExecutor(128)) so that async backends (Modal, Docker) don't deadlock inside Atropos's event loop.

Returns an AgentResult:

@dataclass
class AgentResult:
    messages: List[Dict[str, Any]]       # Full conversation history
    turns_used: int                       # Number of LLM calls made
    finished_naturally: bool              # True if model stopped on its own
    reasoning_per_turn: List[Optional[str]]  # Extracted reasoning content
    tool_errors: List[ToolError]          # Errors encountered during tool dispatch
    managed_state: Optional[Dict]         # VLLM ManagedServer state (Phase 2)

Tool Context

ToolContext (environments/tool_context.py) gives reward functions direct access to the same sandbox the model used during its rollout. The task_id scoping means all state (files, processes, browser tabs) is preserved.

async def compute_reward(self, item, result, ctx: ToolContext):
    # Run tests in the model's terminal sandbox
    test = ctx.terminal("pytest -v")
    if test["exit_code"] == 0:
        return 1.0

    # Check if a file was created
    content = ctx.read_file("/workspace/solution.py")
    if content.get("content"):
        return 0.5

    # Download files for local verification
    ctx.download_file("/remote/output.bin", "/local/output.bin")
    return 0.0

Available methods:

Category	Methods
Terminal	`terminal(command, timeout)`
Files	`read_file(path)`, `write_file(path, content)`, `search(query, path)`
Transfers	`upload_file()`, `upload_dir()`, `download_file()`, `download_dir()`
Web	`web_search(query)`, `web_extract(urls)`
Browser	`browser_navigate(url)`, `browser_snapshot()`
Generic	`call_tool(name, args)` — escape hatch for any hermes-agent tool
Cleanup	`cleanup()` — release all resources

Tool Call Parsers

For Phase 2 (VLLM ManagedServer), the server returns raw text without structured tool calls. Client-side parsers in environments/tool_call_parsers/ extract tool_calls from raw output:

from environments.tool_call_parsers import get_parser

parser = get_parser("hermes")  # or "mistral", "llama3_json", "qwen", "deepseek_v3", etc.
content, tool_calls = parser.parse(raw_model_output)

Available parsers: hermes, mistral, llama3_json, qwen, qwen3_coder, deepseek_v3, deepseek_v3_1, kimi_k2, longcat, glm45, glm47.

In Phase 1 (OpenAI server type), parsers are not needed — the server handles tool call parsing natively.

Available Benchmarks

TerminalBench2

89 challenging terminal tasks with per-task Docker sandbox environments.


What it tests	Single-task coding/sysadmin ability
Scoring	Binary pass/fail (test suite verification)
Sandbox	Modal cloud sandboxes (per-task Docker images)
Tools	`terminal` + `file`
Tasks	89 tasks across multiple categories
Cost	~$50–200 for full eval (parallel execution)
Time	~2–4 hours

python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
    --config environments/benchmarks/terminalbench_2/default.yaml

# Run specific tasks
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
    --config environments/benchmarks/terminalbench_2/default.yaml \
    --env.task_filter fix-git,git-multibranch

Dataset: NousResearch/terminal-bench-2 on HuggingFace.

TBLite (OpenThoughts Terminal Bench Lite)

100 difficulty-calibrated tasks — a faster proxy for TerminalBench2.


What it tests	Same as TB2 (coding/sysadmin), calibrated difficulty tiers
Scoring	Binary pass/fail
Sandbox	Modal cloud sandboxes
Tools	`terminal` + `file`
Tasks	100 tasks: Easy (40), Medium (26), Hard (26), Extreme (8)
Correlation	r=0.911 with full TB2
Speed	2.6–8× faster than TB2

python environments/benchmarks/tblite/tblite_env.py evaluate \
    --config environments/benchmarks/tblite/default.yaml

TBLite is a thin subclass of TerminalBench2 — only the dataset and timeouts differ. Created by the OpenThoughts Agent team (Snorkel AI + Bespoke Labs). Dataset: NousResearch/openthoughts-tblite.

YC-Bench

Long-horizon strategic benchmark — the agent plays CEO of an AI startup.


What it tests	Multi-turn strategic coherence over hundreds of turns
Scoring	Composite: `0.5 × survival + 0.5 × normalised_funds`
Sandbox	Local terminal (no Modal needed)
Tools	`terminal` only
Runs	9 default (3 presets × 3 seeds), sequential
Cost	~$50–200 for full eval
Time	~3–6 hours

# Install yc-bench (optional dependency)
pip install "hermes-agent[yc-bench]"

# Run evaluation
bash environments/benchmarks/yc_bench/run_eval.sh

# Or directly
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
    --config environments/benchmarks/yc_bench/default.yaml

# Quick single-preset test
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
    --config environments/benchmarks/yc_bench/default.yaml \
    --env.presets '["fast_test"]' --env.seeds '[1]'

YC-Bench uses collinear-ai/yc-bench — a deterministic simulation with 4 skill domains (research, inference, data_environment, training), prestige system, employee management, and financial pressure. Unlike TB2's per-task binary scoring, YC-Bench measures whether an agent can maintain coherent strategy over hundreds of compounding decisions.

Training Environments

TerminalTestEnv

A minimal self-contained environment with inline tasks (no external dataset). Used for validating the full stack end-to-end. Each task asks the model to create a file at a known path; the verifier checks the content.

# Process mode (saves rollouts to JSONL, no training server needed)
python environments/terminal_test_env/terminal_test_env.py process \
    --env.data_path_to_save_groups terminal_test_output.jsonl

# Serve mode (connects to Atropos API for RL training)
python environments/terminal_test_env/terminal_test_env.py serve

HermesSweEnv

SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.

python environments/hermes_swe_env/hermes_swe_env.py serve \
    --openai.model_name YourModel \
    --env.dataset_name bigcode/humanevalpack \
    --env.terminal_backend modal

Running Environments

Every environment is a standalone Python script with three CLI subcommands:

`evaluate` — Run a benchmark

For eval-only environments (benchmarks). Runs all items, computes metrics, logs to wandb.

python environments/benchmarks/tblite/tblite_env.py evaluate \
    --config environments/benchmarks/tblite/default.yaml \
    --openai.model_name anthropic/claude-sonnet-4.6

No training server or run-api needed. The environment handles everything.

`process` — Generate SFT data

Runs rollouts and saves scored trajectories to JSONL. Useful for generating training data without a full RL loop.

python environments/terminal_test_env/terminal_test_env.py process \
    --env.data_path_to_save_groups output.jsonl \
    --openai.model_name anthropic/claude-sonnet-4.6

Output format: each line is a scored trajectory with the full conversation history, reward, and metadata.

`serve` — Connect to Atropos for RL training

Connects the environment to a running Atropos API server (run-api). Used during live RL training.

# Terminal 1: Start the Atropos API
run-api

# Terminal 2: Start the environment
python environments/hermes_swe_env/hermes_swe_env.py serve \
    --openai.model_name YourModel

The environment receives items from Atropos, runs agent rollouts, computes rewards, and sends scored trajectories back for training.

Two-Phase Operation

Phase 1: OpenAI Server (Eval / SFT)

Uses server.chat_completion() with tools= parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns ChatCompletion objects with structured tool_calls.

Use for: evaluation, SFT data generation, benchmarks, testing
Placeholder tokens are created for the Atropos pipeline (since real token IDs aren't available from the OpenAI API)

Phase 2: VLLM ManagedServer (Full RL)

Uses ManagedServer for exact token IDs + logprobs via /generate. A client-side tool call parser reconstructs structured tool_calls from raw output.

Use for: full RL training with GRPO/PPO
Real tokens, masks, and logprobs flow through the pipeline
Set tool_call_parser in config to match your model's format (e.g., "hermes", "qwen", "mistral")

Creating Environments

Training Environment

from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from atroposlib.envs.server_handling.server_manager import APIServerConfig

class MyEnvConfig(HermesAgentEnvConfig):
    my_custom_field: str = "default_value"

class MyEnv(HermesAgentBaseEnv):
    name = "my-env"
    env_config_cls = MyEnvConfig

    @classmethod
    def config_init(cls):
        env_config = MyEnvConfig(
            enabled_toolsets=["terminal", "file"],
            terminal_backend="modal",
            max_agent_turns=30,
        )
        server_configs = [APIServerConfig(
            base_url="https://openrouter.ai/api/v1",
            model_name="anthropic/claude-sonnet-4.6",
            server_type="openai",
        )]
        return env_config, server_configs

    async def setup(self):
        from datasets import load_dataset
        self.dataset = list(load_dataset("my-dataset", split="train"))
        self.iter = 0

    async def get_next_item(self):
        item = self.dataset[self.iter % len(self.dataset)]
        self.iter += 1
        return item

    def format_prompt(self, item):
        return item["instruction"]

    async def compute_reward(self, item, result, ctx):
        # ctx gives full tool access to the rollout's sandbox
        test = ctx.terminal("pytest -v")
        return 1.0 if test["exit_code"] == 0 else 0.0

    async def evaluate(self, *args, **kwargs):
        # Periodic evaluation during training
        pass

if __name__ == "__main__":
    MyEnv.cli()

Eval-Only Benchmark

For benchmarks, follow the pattern used by TerminalBench2, TBLite, and YC-Bench:

Create under environments/benchmarks/your-benchmark/
Set eval-only config: eval_handling=STOP_TRAIN, steps_per_eval=1, total_steps=1
Stub training methods: collect_trajectories() returns (None, []), score() returns None
Implement rollout_and_score_eval(eval_item) — the per-item agent loop + scoring
Implement evaluate() — orchestrates all runs, computes aggregate metrics
Add streaming JSONL for crash-safe result persistence
Add cleanup: KeyboardInterrupt handling, cleanup_all_environments(), _tool_executor.shutdown()
Run with evaluate subcommand

See environments/benchmarks/yc_bench/yc_bench_env.py for a clean, well-documented reference implementation.

Configuration Reference

HermesAgentEnvConfig Fields

Field	Type	Default	Description
`enabled_toolsets`	`List[str]`	`None` (all)	Which hermes toolsets to enable
`disabled_toolsets`	`List[str]`	`None`	Toolsets to filter out
`distribution`	`str`	`None`	Probabilistic toolset distribution name
`max_agent_turns`	`int`	`30`	Max LLM calls per rollout
`agent_temperature`	`float`	`1.0`	Sampling temperature
`system_prompt`	`str`	`None`	System message for the agent
`terminal_backend`	`str`	`"local"`	`local`, `docker`, `modal`, `daytona`, `ssh`, `singularity`
`terminal_timeout`	`int`	`120`	Seconds per terminal command
`terminal_lifetime`	`int`	`3600`	Max sandbox lifetime
`dataset_name`	`str`	`None`	HuggingFace dataset identifier
`tool_pool_size`	`int`	`128`	Thread pool size for tool execution
`tool_call_parser`	`str`	`"hermes"`	Parser for Phase 2 raw output
`extra_body`	`Dict`	`None`	Extra params for OpenAI API (e.g., OpenRouter provider prefs)
`eval_handling`	`Enum`	`STOP_TRAIN`	`STOP_TRAIN`, `LIMIT_TRAIN`, `NONE`

YAML Configuration

Environments can be configured via YAML files passed with --config:

env:
  enabled_toolsets: ["terminal", "file"]
  max_agent_turns: 60
  max_token_length: 32000
  agent_temperature: 0.8
  terminal_backend: "modal"
  terminal_timeout: 300
  dataset_name: "NousResearch/terminal-bench-2"
  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
  use_wandb: true
  wandb_name: "my-benchmark"

openai:
  base_url: "https://openrouter.ai/api/v1"
  model_name: "anthropic/claude-sonnet-4.6"
  server_type: "openai"
  health_check: false

YAML values override config_init() defaults. CLI arguments override YAML values:

python my_env.py evaluate \
    --config my_config.yaml \
    --openai.model_name anthropic/claude-opus-4.6  # overrides YAML

Prerequisites

For all environments

Python >= 3.11
atroposlib: pip install git+https://github.com/NousResearch/atropos.git
An LLM API key (OpenRouter, OpenAI, or self-hosted VLLM/SGLang)

Modal account and CLI: pip install "hermes-agent[modal]"
MODAL_TOKEN_ID and MODAL_TOKEN_SECRET environment variables

For YC-Bench

pip install "hermes-agent[yc-bench]" (installs the yc-bench CLI + SQLAlchemy)
No Modal needed — runs with local terminal backend

For RL training

TINKER_API_KEY — API key for the Tinker training service
WANDB_API_KEY — for Weights & Biases metrics tracking
The tinker-atropos submodule (at tinker-atropos/ in the repo)

See RL Training for the agent-driven RL workflow.

Directory Structure

environments/
├── hermes_base_env.py          # Abstract base class (HermesAgentBaseEnv)
├── agent_loop.py               # Multi-turn agent engine (HermesAgentLoop)
├── tool_context.py             # Per-rollout tool access for reward functions
├── patches.py                  # Async-safety patches for Modal backend
│
├── tool_call_parsers/          # Phase 2 client-side parsers
│   ├── hermes_parser.py        # Hermes/ChatML <tool_call> format
│   ├── mistral_parser.py       # Mistral [TOOL_CALLS] format
│   ├── llama_parser.py         # Llama 3 JSON tool calling
│   ├── qwen_parser.py          # Qwen format
│   ├── deepseek_v3_parser.py   # DeepSeek V3 format
│   └── ...                     # + kimi_k2, longcat, glm45/47, etc.
│
├── terminal_test_env/          # Stack validation (inline tasks)
├── hermes_swe_env/             # SWE-bench training environment
│
└── benchmarks/                 # Evaluation benchmarks
    ├── terminalbench_2/        # 89 terminal tasks, Modal sandboxes
    ├── tblite/                 # 100 calibrated tasks (fast TB2 proxy)
    └── yc_bench/               # Long-horizon strategic benchmark

Architecture​

BaseEnv (Atropos)​

HermesAgentBaseEnv​

Concrete Environments​

Core Components​

Agent Loop​

Tool Context​

Tool Call Parsers​

Available Benchmarks​

TerminalBench2​

TBLite (OpenThoughts Terminal Bench Lite)​

YC-Bench​

Training Environments​

TerminalTestEnv​

HermesSweEnv​

Running Environments​

evaluate — Run a benchmark​

process — Generate SFT data​

serve — Connect to Atropos for RL training​

Two-Phase Operation​

Phase 1: OpenAI Server (Eval / SFT)​

Phase 2: VLLM ManagedServer (Full RL)​

Creating Environments​

Training Environment​

Eval-Only Benchmark​

Configuration Reference​

HermesAgentEnvConfig Fields​

YAML Configuration​

Prerequisites​

For all environments​

For Modal-sandboxed benchmarks (TB2, TBLite)​

For YC-Bench​

For RL training​

Directory Structure​