Skip to main content

Voice & TTS

Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.

Text-to-Speech

Convert text to speech with three providers:

ProviderQualityCostAPI Key
Edge TTS (default)GoodFreeNone needed
ElevenLabsExcellentPaidELEVENLABS_API_KEY
OpenAI TTSGoodPaidVOICE_TOOLS_OPENAI_KEY

Platform Delivery

PlatformDeliveryFormat
TelegramVoice bubble (plays inline)Opus .ogg
DiscordAudio file attachmentMP3
WhatsAppAudio file attachmentMP3
CLISaved to ~/.hermes/audio_cache/MP3

Configuration

# In ~/.hermes/config.yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai"
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
model_id: "eleven_multilingual_v2"
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer

Telegram Voice Bubbles & ffmpeg

Telegram voice bubbles require Opus/OGG audio format:

  • OpenAI and ElevenLabs produce Opus natively — no extra setup
  • Edge TTS (default) outputs MP3 and needs ffmpeg to convert:
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Fedora
sudo dnf install ffmpeg

Without ffmpeg, Edge TTS audio is sent as a regular audio file (playable, but shows as a rectangular player instead of a voice bubble).

tip

If you want voice bubbles without installing ffmpeg, switch to the OpenAI or ElevenLabs provider.

Voice Message Transcription

Voice messages sent on Telegram, Discord, WhatsApp, or Slack are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.

ProviderModelQualityCost
OpenAI Whisperwhisper-1 (default)GoodLow
OpenAI GPT-4ogpt-4o-mini-transcribeBetterMedium
OpenAI GPT-4ogpt-4o-transcribeBestHigher

Requires VOICE_TOOLS_OPENAI_KEY in ~/.hermes/.env.

Configuration

# In ~/.hermes/config.yaml
stt:
enabled: true
model: "whisper-1"