Multi-Modal Mastery
OpenClaw isn't limited to text. The Media Pipeline allows the agent to "see" your screen, "hear" your voice memos, and process incoming documents.
How it Works
- Ingestion: Media is uploaded via a channel (e.g., an image sent on WhatsApp).
- Normalization: The gateway resizes or compresses the media to meet LLM input requirements (e.g., 512x512 for vision models).
- Embedding/Analysis: The media is sent to the vision/audio model, and the resulting description or transcription is fed into the main conversation loop.
Local Processing: If using a local model with vision support (e.g., LLaVA), the media
pipeline ensures that files never leave your local workspace.