The Media Pipeline

Multi-Modal Mastery

OpenClaw isn't limited to text. The Media Pipeline allows the agent to "see" your screen, "hear" your voice memos, and process incoming documents.

How it Works

  1. Ingestion: Media is uploaded via a channel (e.g., an image sent on WhatsApp).
  2. Normalization: The gateway resizes or compresses the media to meet LLM input requirements (e.g., 512x512 for vision models).
  3. Embedding/Analysis: The media is sent to the vision/audio model, and the resulting description or transcription is fed into the main conversation loop.
Local Processing: If using a local model with vision support (e.g., LLaVA), the media pipeline ensures that files never leave your local workspace.