The Media Pipeline – OpenClaw Global Knowledgebase

Multi-Modal Mastery

OpenClaw isn't limited to text. The Media Pipeline allows the agent to "see" your screen, "hear" your voice memos, and process incoming documents.

How it Works

Ingestion: Media is uploaded via a channel (e.g., an image sent on WhatsApp).
Normalization: The gateway resizes or compresses the media to meet LLM input requirements (e.g., 512x512 for vision models).
Embedding/Analysis: The media is sent to the vision/audio model, and the resulting description or transcription is fed into the main conversation loop.

Local Processing: If using a local model with vision support (e.g., LLaVA), the media pipeline ensures that files never leave your local workspace.