Hermes S6 Container Supervision

Modify, debug, or extend the s6-overlay supervision tree inside the Hermes Agent Docker image — adding new services, debugging profile gateways, understanding the Architecture B main-program pattern.

Skill metadata


Source	Bundled (installed by default)
Path	`skills/software-development/hermes-s6-container-supervision`
Version	`1.0.0`
Author	Hermes Agent
License	MIT
Tags	`docker`, `s6`, `supervision`, `gateway`, `profiles`
Related skills	`hermes-agent`, `hermes-agent-dev`

Reference: full SKILL.md

info

The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

Hermes s6-overlay Container Supervision

When to use this skill

Load this skill when you're working on:

Adding or removing a static service in the Hermes Docker image (something that should be supervised at every container start, like the dashboard)
Diagnosing why a per-profile gateway isn't starting, restarting, or surviving docker restart
Understanding why the container's CMD is /opt/hermes/docker/main-wrapper.sh and how leading-dash args reach the user's program
Modifying cont-init.d boot scripts (UID remap, volume seeding, profile reconciliation)
Changing the rendered run-script for per-profile gateways (Phase 4)

If you're just running the Hermes Agent and want to use Docker, see website/docs/user-guide/docker.md instead.

Architecture at a glance

/init                                  ← PID 1 (s6-overlay v3.2.3.0)
├── cont-init.d                        ← oneshot setup, runs as root
│   ├── 01-hermes-setup                ← docker/stage2-hook.sh
│   │   ├── UID/GID remap
│   │   ├── chown /opt/data
│   │   ├── chown /opt/data/profiles (every boot)
│   │   ├── seed .env / config.yaml / SOUL.md
│   │   └── skills_sync.py
│   └── 02-reconcile-profiles          ← hermes_cli.container_boot
│       ├── chown /run/service (hermes-writable for runtime register)
│       └── walk $HERMES_HOME/profiles/<name>/gateway_state.json
│           → recreate /run/service/gateway-<name>/
│           → auto-start only those with prior_state == "running"
│
├── s6-rc.d (static services, in /etc/s6-overlay/s6-rc.d/)
│   ├── main-hermes/run                ← exec sleep infinity (no-op slot)
│   └── dashboard/run                  ← if HERMES_DASHBOARD=1, runs `hermes dashboard`
│
├── /run/service (s6-svscan watches; tmpfs)
│   ├── gateway-coder/                 ← runtime-registered per-profile
│   │   ├── type        ("longrun")
│   │   ├── run         ("#!/command/with-contenv sh ... exec s6-setuidgid hermes hermes -p coder gateway run")
│   │   ├── down        (marker — present means "registered but don't auto-start")
│   │   └── log/run     (s6-log → $HERMES_HOME/logs/gateways/coder/current)
│   └── ...
│
└── CMD ("main program")               ← /opt/hermes/docker/main-wrapper.sh
    └── routes user args: bare exec | hermes subcommand | hermes (no args)
        — exec'd by /init with stdin/stdout/stderr inherited (TTY for --tui)

Key files

Path	Role
`Dockerfile`	s6-overlay install + cont-init.d wiring + `ENTRYPOINT ["/init", "/opt/hermes/docker/main-wrapper.sh"]`
`docker/stage2-hook.sh`	The "old entrypoint logic" — UID remap, chown, seed, skills sync. Runs as cont-init.d/01-hermes-setup.
`docker/cont-init.d/02-reconcile-profiles`	Calls `hermes_cli.container_boot` on every boot to restore profile gateway slots from the persistent volume.
`docker/main-wrapper.sh`	The container's CMD. Routes user args, drops to hermes via `s6-setuidgid`, exec's the chosen program.
`docker/s6-rc.d/main-hermes/run`	No-op `sleep infinity` — slot exists so the s6-rc user bundle is valid; main hermes runs as the CMD, not as a supervised service.
`docker/s6-rc.d/dashboard/run`	Conditional service — `exec sleep infinity` unless `HERMES_DASHBOARD` is truthy.
`docker/entrypoint.sh`	Back-compat shim that `exec`s the stage2 hook. External scripts that hard-coded the old entrypoint path still work.
`hermes_cli/service_manager.py`	`S6ServiceManager`: `register_profile_gateway`, `unregister_profile_gateway`, `start/stop/restart/is_running`, `list_profile_gateways`.
`hermes_cli/container_boot.py`	`reconcile_profile_gateways()` — walks persistent profiles, regenerates s6 slots, emits `container-boot.log`.
`hermes_cli/gateway.py::_dispatch_via_service_manager_if_s6`	Intercepts `hermes gateway start/stop/restart` and routes to s6 when running in a container.

Why Architecture B (CMD as main program, not s6-supervised)

The original plan (v1–v3) called for main hermes to run as a supervised s6-rc service. Two real s6-overlay v3 mechanics blocked that:

cont-init.d scripts receive no CMD args — so the stage2 hook can't parse docker run <image> chat -q "hi" to set HERMES_ARGS for a service run script to consume.
/run/s6/basedir/bin/halt does NOT propagate the exit code written to /run/s6-linux-init-container-results/exitcode. Containers always exit 143 (SIGTERM) regardless. Confirmed by skarnet (s6 author) in issue #477: "if you want a container shutdown, you need to either have your CMD exit, or, if you have no CMD, write the container exit code you want then call halt".

So we use the s6-overlay-native CMD pattern: ENTRYPOINT ["/init", "/opt/hermes/docker/main-wrapper.sh"]. /init prepends the wrapper to user args automatically — so docker run <image> --version becomes /init main-wrapper.sh --version, and --version doesn't get intercepted by /init's POSIX shell. The wrapper drops to hermes via s6-setuidgid, then exec's the chosen program. The program's exit code becomes the container exit code, exactly matching the pre-s6 tini contract.

Trade-off: main hermes is unsupervised under s6. That exactly matches its behavior under tini (the pre-s6 image). Dashboard supervision is the only new guarantee — and per-profile gateways under /run/service/ get full supervision.

Quick recipes

Verify s6 is PID 1 in a running container

docker exec <c> sh -c 'cat /proc/1/comm; readlink /proc/1/exe'
# Expect: s6-svscan or init / /package/admin/s6/.../s6-svscan

Inspect a profile gateway service

# /command/ isn't on docker-exec PATH — use absolute path
docker exec <c> /command/s6-svstat /run/service/gateway-<name>
# "up (pid …) … seconds"            → running
# "down (exitcode N) … seconds, normally up, want up, …" → s6 wants it up but the process keeps exiting (crash loop)
# "down … normally up, ready …"     → user stopped it

Bring a service up/down manually

docker exec <c> /command/s6-svc -u /run/service/gateway-<name>   # up
docker exec <c> /command/s6-svc -d /run/service/gateway-<name>   # down
docker exec <c> /command/s6-svc -t /run/service/gateway-<name>   # SIGTERM (restart)

Watch the cont-init reconciler log

docker exec <c> tail -n 50 /opt/data/logs/container-boot.log
# 2026-05-21T06:18:05+0000 profile=coder prior_state=running action=started
# 2026-05-21T06:18:05+0000 profile=writer prior_state=stopped action=registered

Add a new static service

Create docker/s6-rc.d/<name>/type with longrun\n and docker/s6-rc.d/<name>/run (use #!/command/with-contenv sh + # shellcheck shell=sh).
Drop to hermes via s6-setuidgid hermes at the top of run (unless you specifically need root).
Create empty docker/s6-rc.d/<name>/dependencies.d/base so it waits for the base bundle.
Create empty docker/s6-rc.d/user/contents.d/<name> so it joins the user bundle.
The COPY docker/s6-rc.d/ in the Dockerfile picks it up automatically — no other changes.

Change the per-profile gateway run command

Edit S6ServiceManager._render_run_script in hermes_cli/service_manager.py. The function is also called by hermes_cli/container_boot.py::_register_service during boot reconciliation, so it's the single source of truth. Update the corresponding assertion in tests/hermes_cli/test_service_manager.py::test_s6_register_creates_service_dir_and_triggers_scan.

Run the docker test harness

docker build -t hermes-agent-harness:latest .
HERMES_TEST_IMAGE=hermes-agent-harness:latest scripts/run_tests.sh tests/docker/ -v
# Expect 19 passed, 0 xfailed against the s6 image

The harness lives in tests/docker/ and skips when Docker isn't available. The per-test timeout is bumped to 180s (see tests/docker/conftest.py).

Common pitfalls

"command not found" via `docker exec`

/command/ (where s6-overlay puts its binaries) is on PATH only for processes spawned by the supervision tree — services, cont-init.d, main-wrapper.sh. docker exec <c> s6-svstat … will fail with "command not found"; always use the absolute path /command/s6-svstat. The hermes binary works because the Dockerfile adds /opt/hermes/.venv/bin to the runtime ENV PATH.

Profile directory ownership

The cont-init reconciler runs as hermes (s6-setuidgid hermes in 02-reconcile-profiles). If a profile dir ends up root-owned (e.g. because docker exec <c> hermes profile create … ran as root by default), the reconciler can't read SOUL.md and fails with PermissionError. Mitigation: stage2-hook.sh chowns $HERMES_HOME/profiles to hermes on every boot, idempotently. Don't remove that block.

Files written by `docker exec` are root-owned

docker exec defaults to root. Either pass --user hermes or rely on the stage2 chown sweep next reboot. Don't write files under $HERMES_HOME/profiles/<name>/ as root manually — the next reconcile pass will sweep them but in-flight operations may hit perm errors.

Service slot exists but s6-svstat says "s6-supervise not running"

The service directory is on tmpfs and was wiped on container restart. Either the cont-init reconciler hasn't run yet (give it a moment after docker restart) or it failed. Check docker logs <c> | grep '02-reconcile'.

Gateway starts then immediately exits (`down (exitcode 1)` in svstat)

Most likely the profile has no model or auth configured. The service slot is correct — the gateway itself is unconfigured. Run hermes -p <profile> setup first. The s6 supervisor will keep restarting it; that's the desired behavior (when you fix the config, the next attempt succeeds and stays up).

Reconciler skipped a profile

The reconciler keys on the presence of SOUL.md as the "real profile" marker. hermes profile create always seeds it. If a profile dir is missing SOUL.md (stray directory, partial restore, backup-in-progress), the reconciler skips it intentionally. Add a SOUL.md (even empty) to opt back in.

"Help, the container exits 143!"

Check whether something is invoking s6-svscanctl -t or /run/s6/basedir/bin/halt — both cause /init to begin stage 3 shutdown but return 143 (SIGTERM) rather than the desired exit code. This was the Phase 2 architecture pivot from A to B. For container shutdown with a real exit code, you must let the CMD (main-wrapper.sh) exit normally; do not try to control exit from a finish script.

hermes-agent-dev: General hermes-agent codebase navigation
hermes-tool-quirks: Specific Hermes-tool workarounds (sed/grep/etc.) — load when debugging the s6 stack's interaction with hermes built-in tools.

Skill metadata​

Reference: full SKILL.md​

Hermes s6-overlay Container Supervision

When to use this skill​

Architecture at a glance​

Key files​

Why Architecture B (CMD as main program, not s6-supervised)​

Quick recipes​

Verify s6 is PID 1 in a running container​

Inspect a profile gateway service​

Bring a service up/down manually​

Watch the cont-init reconciler log​

Add a new static service​

Change the per-profile gateway run command​

Run the docker test harness​

Common pitfalls​

"command not found" via docker exec​

Profile directory ownership​

Files written by docker exec are root-owned​

Service slot exists but s6-svstat says "s6-supervise not running"​

Gateway starts then immediately exits (down (exitcode 1) in svstat)​

Reconciler skipped a profile​

"Help, the container exits 143!"​

Related skills​