Skip to main content

API Server

The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more — can connect to hermes-agent and use it as a backend.

Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. Tool calls execute invisibly server-side.

Quick Start

1. Enable the API server

Add to ~/.hermes/.env:

API_SERVER_ENABLED=true

2. Start the gateway

hermes gateway

You'll see:

[API Server] API server listening on http://127.0.0.1:8642

3. Connect a frontend

Point any OpenAI-compatible client at http://localhost:8642/v1:

# Test with curl
curl http://localhost:8642/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}'

Or connect Open WebUI, LobeChat, or any other frontend — see the Open WebUI integration guide for step-by-step instructions.

Endpoints

POST /v1/chat/completions

Standard OpenAI Chat Completions format. Stateless — the full conversation is included in each request via the messages array.

Request:

{
"model": "hermes-agent",
"messages": [
{"role": "system", "content": "You are a Python expert."},
{"role": "user", "content": "Write a fibonacci function"}
],
"stream": false
}

Response:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1710000000,
"model": "hermes-agent",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Here's a fibonacci function..."},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250}
}

Streaming ("stream": true): Returns Server-Sent Events (SSE) with token-by-token response chunks. When streaming is enabled in config, tokens are emitted live as the LLM generates them. When disabled, the full response is sent as a single SSE chunk.

POST /v1/responses

OpenAI Responses API format. Supports server-side conversation state via previous_response_id — the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it.

Request:

{
"model": "hermes-agent",
"input": "What files are in my project?",
"instructions": "You are a helpful coding assistant.",
"store": true
}

Response:

{
"id": "resp_abc123",
"object": "response",
"status": "completed",
"model": "hermes-agent",
"output": [
{"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"},
{"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"},
{"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]}
],
"usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250}
}

Multi-turn with previous_response_id

Chain responses to maintain full context (including tool calls) across turns:

{
"input": "Now show me the README",
"previous_response_id": "resp_abc123"
}

The server reconstructs the full conversation from the stored response chain — all previous tool calls and results are preserved.

Named conversations

Use the conversation parameter instead of tracking response IDs:

{"input": "Hello", "conversation": "my-project"}
{"input": "What's in src/?", "conversation": "my-project"}
{"input": "Run the tests", "conversation": "my-project"}

The server automatically chains to the latest response in that conversation. Like the /title command for gateway sessions.

GET /v1/responses/{id}

Retrieve a previously stored response by ID.

DELETE /v1/responses/{id}

Delete a stored response.

GET /v1/models

Lists hermes-agent as an available model. Required by most frontends for model discovery.

GET /health

Health check. Returns {"status": "ok"}.

System Prompt Handling

When a frontend sends a system message (Chat Completions) or instructions field (Responses API), hermes-agent layers it on top of its core system prompt. Your agent keeps all its tools, memory, and skills — the frontend's system prompt adds extra instructions.

This means you can customize behavior per-frontend without losing capabilities:

  • Open WebUI system prompt: "You are a Python expert. Always include type hints."
  • The agent still has terminal, file tools, web search, memory, etc.

Authentication

Bearer token auth via the Authorization header:

Authorization: Bearer ***

Configure the key via API_SERVER_KEY env var. If no key is set, all requests are allowed (for local-only use).

Security

The API server gives full access to hermes-agent's toolset, including terminal commands. If you change the bind address to 0.0.0.0 (network-accessible), always set API_SERVER_KEY — without it, anyone on your network can execute arbitrary commands on your machine.

The default bind address (127.0.0.1) is safe for local-only use.

Configuration

Environment Variables

VariableDefaultDescription
API_SERVER_ENABLEDfalseEnable the API server
API_SERVER_PORT8642HTTP server port
API_SERVER_HOST127.0.0.1Bind address (localhost only by default)
API_SERVER_KEY(none)Bearer token for auth

config.yaml

# Not yet supported — use environment variables.
# config.yaml support coming in a future release.

CORS

The API server includes CORS headers on all responses (Access-Control-Allow-Origin: *), so browser-based frontends can connect directly.

Compatible Frontends

Any frontend that supports the OpenAI API format works. Tested/documented integrations:

FrontendStarsConnection
Open WebUI126kFull guide available
LobeChat73kCustom provider endpoint
LibreChat34kCustom endpoint in librechat.yaml
AnythingLLM56kGeneric OpenAI provider
NextChat87kBASE_URL env var
ChatBox39kAPI Host setting
Jan26kRemote model config
HF Chat-UI8kOPENAI_BASE_URL
big-AGI7kCustom endpoint
OpenAI Python SDKOpenAI(base_url="http://localhost:8642/v1")
curlDirect HTTP requests

Limitations

  • Response storage is in-memory — stored responses (for previous_response_id) are lost on gateway restart. Max 100 stored responses (LRU eviction).
  • No file upload — vision/document analysis via uploaded files is not yet supported through the API.
  • Model field is cosmetic — the model field in requests is accepted but the actual LLM model used is configured server-side in config.yaml.