Skip to main content

Vision & Image Paste

Hermes Agent supports multimodal vision โ€” you can paste images from your clipboard directly into the CLI and ask the agent to analyze, describe, or work with them. Images are sent to the model as base64-encoded content blocks, so any vision-capable model can process them.

How It Worksโ€‹

  1. Copy an image to your clipboard (screenshot, browser image, etc.)
  2. Attach it using one of the methods below
  3. Type your question and press Enter
  4. The image appears as a [๐Ÿ“Ž Image #1] badge above the input
  5. On submit, the image is sent to the model as a vision content block

You can attach multiple images before sending โ€” each gets its own badge. Press Ctrl+C to clear all attached images.

Images are saved to ~/.hermes/images/ as PNG files with timestamped filenames.

Paste Methodsโ€‹

How you attach an image depends on your terminal environment. Not all methods work everywhere โ€” here's the full breakdown:

/paste Commandโ€‹

The most reliable method. Works everywhere.

/paste

Type /paste and press Enter. Hermes checks your clipboard for an image and attaches it. This works in every environment because it explicitly calls the clipboard backend โ€” no terminal keybinding interception to worry about.

Ctrl+V / Cmd+V (Bracketed Paste)โ€‹

When you paste text that's on the clipboard alongside an image, Hermes automatically checks for an image too. This works when:

  • Your clipboard contains both text and an image (some apps put both on the clipboard when you copy)
  • Your terminal supports bracketed paste (most modern terminals do)
warning

If your clipboard has only an image (no text), Ctrl+V does nothing in most terminals. Terminals can only paste text โ€” there's no standard mechanism to paste binary image data. Use /paste or Alt+V instead.

Alt+Vโ€‹

Alt key combinations pass through most terminal emulators (they're sent as ESC + key rather than being intercepted). Press Alt+V to check the clipboard for an image.

caution

Does not work in VSCode's integrated terminal. VSCode intercepts many Alt+key combos for its own UI. Use /paste instead.

Ctrl+V (Raw โ€” Linux Only)โ€‹

On Linux desktop terminals (GNOME Terminal, Konsole, Alacritty, etc.), Ctrl+V is not the paste shortcut โ€” Ctrl+Shift+V is. So Ctrl+V sends a raw byte to the application, and Hermes catches it to check the clipboard. This only works on Linux desktop terminals with X11 or Wayland clipboard access.

Platform Compatibilityโ€‹

Environment/pasteCtrl+V text+imageAlt+VNotes
macOS Terminal / iTerm2โœ…โœ…โœ…Best experience โ€” osascript always available
Linux X11 desktopโœ…โœ…โœ…Requires xclip (apt install xclip)
Linux Wayland desktopโœ…โœ…โœ…Requires wl-paste (apt install wl-clipboard)
WSL2 (Windows Terminal)โœ…โœ…ยนโœ…Uses powershell.exe โ€” no extra install needed
VSCode Terminal (local)โœ…โœ…ยนโŒVSCode intercepts Alt+key
VSCode Terminal (SSH)โŒยฒโŒยฒโŒRemote clipboard not accessible
SSH terminal (any)โŒยฒโŒยฒโŒยฒRemote clipboard not accessible

ยน Only when clipboard has both text and an image (image-only clipboard = nothing happens) ยฒ See SSH & Remote Sessions below

Platform-Specific Setupโ€‹

macOSโ€‹

No setup required. Hermes uses osascript (built into macOS) to read the clipboard. For faster performance, optionally install pngpaste:

brew install pngpaste

Linux (X11)โ€‹

Install xclip:

# Ubuntu/Debian
sudo apt install xclip

# Fedora
sudo dnf install xclip

# Arch
sudo pacman -S xclip

Linux (Wayland)โ€‹

Modern Linux desktops (Ubuntu 22.04+, Fedora 34+) often use Wayland by default. Install wl-clipboard:

# Ubuntu/Debian
sudo apt install wl-clipboard

# Fedora
sudo dnf install wl-clipboard

# Arch
sudo pacman -S wl-clipboard
How to check if you're on Wayland
echo $XDG_SESSION_TYPE
# "wayland" = Wayland, "x11" = X11, "tty" = no display server

WSL2โ€‹

No extra setup required. Hermes detects WSL2 automatically (via /proc/version) and uses powershell.exe to access the Windows clipboard through .NET's System.Windows.Forms.Clipboard. This is built into WSL2's Windows interop โ€” powershell.exe is available by default.

The clipboard data is transferred as base64-encoded PNG over stdout, so no file path conversion or temp files are needed.

WSLg Note

If you're running WSLg (WSL2 with GUI support), Hermes tries the PowerShell path first, then falls back to wl-paste. WSLg's clipboard bridge only supports BMP format for images โ€” Hermes auto-converts BMP to PNG using Pillow (if installed) or ImageMagick's convert command.

Verify WSL2 clipboard accessโ€‹

# 1. Check WSL detection
grep -i microsoft /proc/version

# 2. Check PowerShell is accessible
which powershell.exe

# 3. Copy an image, then check
powershell.exe -NoProfile -Command "Add-Type -AssemblyName System.Windows.Forms; [System.Windows.Forms.Clipboard]::ContainsImage()"
# Should print "True"

SSH & Remote Sessionsโ€‹

Clipboard paste does not work over SSH. When you SSH into a remote machine, the Hermes CLI runs on the remote host. All clipboard tools (xclip, wl-paste, powershell.exe, osascript) read the clipboard of the machine they run on โ€” which is the remote server, not your local machine. Your local clipboard is inaccessible from the remote side.

Workarounds for SSHโ€‹

  1. Upload the image file โ€” Save the image locally, upload it to the remote server via scp, VSCode's file explorer (drag-and-drop), or any file transfer method. Then reference it by path. (A /attach <filepath> command is planned for a future release.)

  2. Use a URL โ€” If the image is accessible online, just paste the URL in your message. The agent can use vision_analyze to look at any image URL directly.

  3. X11 forwarding โ€” Connect with ssh -X to forward X11. This lets xclip on the remote machine access your local X11 clipboard. Requires an X server running locally (XQuartz on macOS, built-in on Linux X11 desktops). Slow for large images.

  4. Use a messaging platform โ€” Send images to Hermes via Telegram, Discord, Slack, or WhatsApp. These platforms handle image upload natively and are not affected by clipboard/terminal limitations.

Why Terminals Can't Paste Imagesโ€‹

This is a common source of confusion, so here's the technical explanation:

Terminals are text-based interfaces. When you press Ctrl+V (or Cmd+V), the terminal emulator:

  1. Reads the clipboard for text content
  2. Wraps it in bracketed paste escape sequences
  3. Sends it to the application through the terminal's text stream

If the clipboard contains only an image (no text), the terminal has nothing to send. There is no standard terminal escape sequence for binary image data. The terminal simply does nothing.

This is why Hermes uses a separate clipboard check โ€” instead of receiving image data through the terminal paste event, it calls OS-level tools (osascript, powershell.exe, xclip, wl-paste) directly via subprocess to read the clipboard independently.

Supported Modelsโ€‹

Image paste works with any vision-capable model. The image is sent as a base64-encoded data URL in the OpenAI vision content format:

{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,..."
}
}

Most modern models support this format, including GPT-4 Vision, Claude (with vision), Gemini, and open-source multimodal models served through OpenRouter.