Skip to main content

Context Compression & Prompt Caching

Hermes manages long conversations with two complementary mechanisms:

  • prompt caching
  • context compression

Primary files:

  • agent/prompt_caching.py
  • agent/context_compressor.py
  • run_agent.py

Prompt caching

For Anthropic/native and Claude-via-OpenRouter flows, Hermes applies Anthropic-style cache markers.

Current strategy:

  • cache the system prompt
  • cache the last 3 non-system messages
  • default TTL is 5 minutes unless explicitly extended

This is implemented in agent/prompt_caching.py.

Why prompt stability matters

Prompt caching only helps when the stable prefix remains stable. That is why Hermes avoids rebuilding or mutating the core system prompt mid-session unless it has to.

Compression trigger

Hermes can compress context when conversations become large. Configuration defaults live in config.yaml, and the compressor also has runtime checks based on actual prompt token counts.

Compression algorithm

The compressor protects:

  • the first N turns
  • the last N turns

and summarizes the middle section.

It also cleans up structural issues such as orphaned tool-call/result pairs so the API never receives invalid conversation structure after compression.

Pre-compression memory flush

Before compression, Hermes can give the model one last chance to persist memory so facts are not lost when middle turns are summarized away.

Session lineage after compression

Compression can split the session into a new session ID while preserving parent lineage in the state DB.

This lets Hermes continue operating with a smaller active context while retaining a searchable ancestry chain.

Re-injected state after compression

After compression, Hermes may re-inject compact operational state such as:

  • todo snapshot
  • prior-read-files summary