Context Compression & Prompt Caching
Hermes manages long conversations with two complementary mechanisms:
- prompt caching
- context compression
Primary files:
agent/prompt_caching.pyagent/context_compressor.pyrun_agent.py
Prompt caching
For Anthropic/native and Claude-via-OpenRouter flows, Hermes applies Anthropic-style cache markers.
Current strategy:
- cache the system prompt
- cache the last 3 non-system messages
- default TTL is 5 minutes unless explicitly extended
This is implemented in agent/prompt_caching.py.
Why prompt stability matters
Prompt caching only helps when the stable prefix remains stable. That is why Hermes avoids rebuilding or mutating the core system prompt mid-session unless it has to.
Compression trigger
Hermes can compress context when conversations become large. Configuration defaults live in config.yaml, and the compressor also has runtime checks based on actual prompt token counts.
Compression algorithm
The compressor protects:
- the first N turns
- the last N turns
and summarizes the middle section.
It also cleans up structural issues such as orphaned tool-call/result pairs so the API never receives invalid conversation structure after compression.
Pre-compression memory flush
Before compression, Hermes can give the model one last chance to persist memory so facts are not lost when middle turns are summarized away.
Session lineage after compression
Compression can split the session into a new session ID while preserving parent lineage in the state DB.
This lets Hermes continue operating with a smaller active context while retaining a searchable ancestry chain.
Re-injected state after compression
After compression, Hermes may re-inject compact operational state such as:
- todo snapshot
- prior-read-files summary