Multi-Modal Harnesses: When Your Agent Needs to See, Hear, and Click

Your harness works. The loop runs cleanly, tool calls resolve, context stays managed. Then someone asks: "Can it look at a screenshot?" And suddenly the whole thing has to change.

Not the model. Claude already knows how to see. The problem is that the harness was built for strings. A multimodal message is not a string. It's a typed content block. And once you make that shift, everything downstream moves with it: how tool results get injected, how context gets tracked, how you estimate token budgets, how you decide what to evict when things fill up.

This is post 07 in the Agent Harnesses series. Earlier posts cover what a harness is and how the loop works. This one covers what breaks when the model is multimodal and the harness isn't.

The messages array pivot

A text harness accumulates strings. The messages array is a list of role-content pairs where content is a string. That assumption runs deep: tool call parsing, context truncation, token counting all treat content as text.

In a multimodal harness, content is an array of typed blocks. Here's what a user message looks like once images enter the picture:

json

// Text-only message
{
  "role": "user",
  "content": "What's in this image?"
}

// Multimodal message - content is now an array of typed blocks
{
  "role": "user",
  "content": [
    {
      "type": "image",
      "source": {
        "type": "base64",
        "media_type": "image/jpeg",
        "data": "/9j/4AAQSkZJRg..."
      }
    },
    {
      "type": "text",
      "text": "What's in this image?"
    }
  ]
}

That shift from string to block array is the thing everything else depends on. Everything that touches the content field, appending to messages, constructing tool results, truncating history, needs to handle both formats. The harness can't assume strings anymore.

Tool results change too. In a text harness, a tool returns a string and you inject it into a tool_result message. In a multimodal harness, a screenshot tool returns an image block. The injection looks like this:

json

{
  "role": "user",
  "content": [
    {
      "type": "tool_result",
      "tool_use_id": "toolu_01...",
      "content": [
        {
          "type": "image",
          "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": "..."
          }
        }
      ]
    }
  ]
}

You construct this explicitly. There's no shortcut that turns a PNG into a message automatically. The harness owns this.

Three ways to pass images

Once you're building multimodal messages, you have three options for getting image data into the API call.

Base64 encoding is the most common. Encode the image bytes and embed them inline. Works for any image. The downside is that you resend the full bytes on every API call. In a multi-turn conversation, every turn carries the image again. At scale, that doubles your payload size per turn.

URL reference is simpler for images already hosted somewhere. Pass the URL and Claude fetches it. Works well for publicly accessible assets. Doesn't work for local files or anything behind auth.

The Files API is the right answer when the same image appears across multiple turns or sessions. Upload once, get a file_id, reference by ID. Anthropic caches the content on their side. The beta header anthropic-beta: files-api-2025-04-14 is required. Files stay stored up to 500 MB each and can be reused without resending bytes. For production harnesses, this is usually where you end up.

Images as tool input

The most common pattern is a screenshot tool. The model requests the current screen state, the harness captures it, converts it to base64, and injects an image block into the tool result. The model reasons about what it sees before deciding the next action.

python

import anthropic
import base64
import subprocess

def screenshot_tool() -> dict:
    # macOS: screencapture -x -t png /tmp/screen.png
    # Linux: scrot /tmp/screen.png
    subprocess.run(["screencapture", "-x", "-t", "png", "/tmp/screen.png"])
    with open("/tmp/screen.png", "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
    return {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": image_data,
        }
    }

The tool returns an image block. The harness puts it inside a tool_result content array and appends it to messages. Not as a string, as a block.

Image-to-code flows (Figma screenshot to React component, whiteboard diagram to architecture code, error screenshot to debug fix) don't need special harness handling. Feed the image as a user message, ask the model to generate code, get text back. The harness just passes the image block through. You only need custom handling when image capture is itself a tool call within the loop.

For reading local files, a small utility covers most cases:

python

import mimetypes

def read_image_tool(file_path: str) -> dict:
    media_type = mimetypes.guess_type(file_path)[0] or "image/jpeg"
    with open(file_path, "rb") as f:
        data = base64.standard_b64encode(f.read()).decode("utf-8")
    return {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": media_type,
            "data": data
        }
    }

Supported formats: JPEG, PNG, GIF, WebP. Max 8000x8000 pixels per image. Max 30 MB. If you're passing more than 20 images in a single request, that limit drops to 2000x2000.

Image token costs

Images bill at the same per-token rate as text. No image surcharge. But a 1000x1000 image costs roughly 1,334 tokens. That's not trivial when you're thinking about context budgets.

PDF pages processed visually run 1,500 to 3,000 tokens per page. A 10-page PDF can consume 30,000 tokens before the model writes a word. Those tokens aren't just cost. They're context window. You're trading space for visual information, and that space doesn't come back once it's used.

The harness needs to account for this explicitly. Text token counters ignore image costs. A rough approximation for Claude's image token cost:

python

def estimate_image_tokens(width: int, height: int) -> int:
    tiles = ((width + 511) // 512) * ((height + 511) // 512)
    return 85 + 170 * tiles  # base cost plus per-tile cost

Add this to your context budget tracking. If you don't, you'll hit limits in ways that are hard to debug.

Images as tool output

The model can also call a tool that generates an image. A generate_image tool calls DALL-E or Stable Diffusion, gets back image bytes, and the harness injects the result as an image block. On the next turn, the model sees the generated image and can reason about it.

python

import openai

def generate_image_tool(prompt: str) -> dict:
    client = openai.OpenAI()
    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1024",
        response_format="b64_json"
    )
    image_data = response.data[0].b64_json
    return {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": image_data
        }
    }

For self-hosted image generation (privacy, cost control), Stable Diffusion via the AUTOMATIC1111 API follows the same pattern. The tool calls the local API, gets base64 bytes, returns an image block.

The problem with image-out workflows is accumulation. A generate-inspect-critique-regenerate loop adds a full image-sized chunk of tokens on every iteration. Without eviction, you'll hit context limits after a handful of rounds. More on this in the context management section.

PDFs

When you send Claude a PDF (via Files API or as a base64 document), it converts each page into an image and extracts text. The model sees both the text and the visual layout of each page. That's why Claude performs better than a text-only extractor on layout-sensitive documents like invoices: field positions matter and the model can see them.

The token cost is real. Each page runs 1,500 to 3,000 tokens. A 50-page PDF is 75,000 to 150,000 tokens for ingestion alone. The limits to know: max 30 MB per PDF, max 100 pages for visual analysis. Over 1,000 pages, visual analysis is dropped and you get text only.

For repeated queries against the same PDF, Files API is the right move:

python

import anthropic

client = anthropic.Anthropic()

# Upload once
with open("document.pdf", "rb") as f:
    response = client.beta.files.upload(
        file=("document.pdf", f, "application/pdf"),
    )
file_id = response.id

# Reference many times without resending bytes
message = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "file", "file_id": file_id}
            },
            {"type": "text", "text": "Summarize the key findings."}
        ]
    }],
    betas=["files-api-2025-04-14"]
)

For large documents, sending the full PDF is often the wrong call. A 200-page technical manual exceeds any context window if sent whole. Four strategies that work in practice:

A page-by-page tool works well. Expose a read_pdf_page(file, page_num) tool and let the model request only what it needs. It does better at this than you'd expect. It reads the table of contents on page one and jumps to relevant sections rather than reading linearly.

The pdf-mcp approach wraps the PDF with chunked reading, hybrid search, and SQLite caching inside an MCP server. The model queries semantically and gets relevant chunks back. Good for documents queried repeatedly across sessions.

Pre-extraction with PyMuPDF4LLM converts the PDF to Markdown, you chunk by section headers, embed the chunks, and retrieve top-k for context. This is the right approach for RAG pipelines where the document is queried by many different agents or users. For complex multi-column layouts, MinerU handles them better, though it's slower.

Prompt caching is worth adding when the same PDF gets queried multiple times in a session. Anthropic's caching cuts token cost by 90% on cache hits. It requires the prompt prefix to stay stable across turns.

Audio

Claude doesn't accept audio input natively. For audio in a harness, you have two patterns, and they make different tradeoffs.

Cascading: audio in, Whisper for speech-to-text, transcript string into the harness loop, text response out, ElevenLabs or Cartesia for text-to-speech, audio out. The harness stays text-native. Audio is only at the edges. You can inspect the transcript at any step, which makes debugging clean. The tradeoff is latency: you're chaining three models in sequence.

python

import openai

client = openai.OpenAI()

def transcribe_audio_tool(audio_file_path: str) -> str:
    with open(audio_file_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="text"
        )
    return transcript  # plain string, drops into messages array as text

The transcript is a string. It drops directly into the messages array as text. No special content block handling needed on the output side.

The second pattern is native speech-to-speech: OpenAI Realtime API or Gemini Live API. Audio in, model processes audio natively, audio out. Lower latency, preserves emotional prosody, harder to debug. Tool-call accuracy is slightly lower. ElevenLabs Flash v2.5 gets the end-to-end latency to around 75ms when you co-locate transcription, reasoning, and synthesis.

For most agent harnesses, cascading is the right call. The latency hit is real but manageable, and you get a clean transcript and full control over the reasoning chain. Native speech-to-speech makes more sense when the product is the conversation itself, not the task completion underneath it.

Video

Video is frames over time. No current model accepts raw video as a stream. Every video workflow comes down to the same thing: extract frames, pass frames as images.

The math is sobering. A 60-second video at 30 fps is 1,800 frames. At roughly 1,334 tokens per frame (1000x1000), that's 2.4 million tokens. No context window handles that. You have to choose which frames to send.

The practical limit for standard context windows: 3 to 5 frames per query. Extended context models (200k+) can handle more, but not dramatically more at image token rates.

python

import subprocess, os

def extract_video_frames_tool(video_path: str, fps: float = 0.5) -> list[str]:
    """Extract frames at given FPS. Returns list of image file paths."""
    os.makedirs("/tmp/frames", exist_ok=True)
    subprocess.run([
        "ffmpeg", "-i", video_path,
        "-vf", f"fps={fps}",
        "-q:v", "2",
        "/tmp/frames/frame_%04d.jpg",
        "-y"
    ])
    frames = sorted([
        f"/tmp/frames/{f}" for f in os.listdir("/tmp/frames")
        if f.endswith(".jpg")
    ])
    return frames

0.5 fps on a 60-second clip gives you 30 frames, roughly 40,000 tokens. That's a meaningful chunk of context for a single tool result. The harness controls how many frames get injected. This is a design decision that belongs in your tool implementation, not something to leave the model to figure out.

For video where specific moments matter, scene detection (PySceneDetect or ffmpeg's scene filter) extracts frames only when the scene changes. That reduces frame count significantly on structured content like screen recordings and presentations.

Query-aware keyframe selection is worth knowing about. FOCUS selects frames relevant to the specific question being asked and reaches competitive accuracy with 40% of frames retained. Token compression approaches (STORM, FlashVLM) are moving through research toward production. For 2026, frame extraction plus scene detection is still the practical approach for most teams.

Long-form video (meeting recordings, lecture captures) needs an ingestion pipeline, not direct model feeding. The MMCTAgent pattern from Microsoft Research is a reasonable reference: transcribe the audio track, identify keyframes, chunk into semantic chapters, then run a multi-agent planner-critic loop for question answering. The model never sees raw video. It sees structured outputs from the pipeline.

Computer use

Computer use is a perception-action loop over a GUI. The model sees the screen, decides what to do, calls an action tool, the harness executes it, captures the new screen state, and the loop continues.

To be clear about one thing: Claude cannot execute actions itself. The harness implements every click, keystroke, and scroll. Claude decides what to do. The harness does it.

The beta header for current models (Opus 4.6, Sonnet 4.6) is computer-use-2025-11-24. Older header for Sonnet 3.7 and earlier: computer-use-2025-01-24. The spec defines three tools: computer for screenshot, click, type, key, scroll; text_editor for file operations; and bash for shell commands (both optional).

python

tools = [
    {
        "type": "computer_20251022",
        "name": "computer",
        "display_width_px": 1920,
        "display_height_px": 1080,
        "display_number": 1
    }
    # text_editor and bash are optional
]

A minimal computer use loop:

python

import anthropic, base64, subprocess

client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Open Firefox and go to example.com"}]

while True:
    response = client.beta.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
        betas=["computer-use-2025-11-24"]
    )

    if response.stop_reason == "end_turn":
        break

    for block in response.content:
        if block.type == "tool_use" and block.name == "computer":
            action = block.input["action"]
            if action == "screenshot":
                subprocess.run(["screencapture", "-x", "/tmp/screen.png"])
                with open("/tmp/screen.png", "rb") as f:
                    data = base64.b64encode(f.read()).decode()
                tool_result = {
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": [{
                        "type": "image",
                        "source": {"type": "base64", "media_type": "image/png", "data": data}
                    }]
                }
            elif action == "left_click":
                coords = block.input["coordinate"]
                subprocess.run(["cliclick", f"c:{coords[0]},{coords[1]}"])
                tool_result = {"type": "tool_result", "tool_use_id": block.id, "content": "clicked"}
            # handle other actions here

    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": [tool_result]})

Anthropic publishes a reference Docker container with example tool implementations. That's the fastest way to get a working environment without building the system interaction layer from scratch.

If you're evaluating alternatives, the two worth knowing about in 2026 are OmniParser (Microsoft) and UI-TARS (ByteDance).

OmniParser converts screenshots into structured UI element descriptions before passing to the model. It uses YOLO for icon detection and Florence2 for recognition. Achieves 39.5% on the ScreenSpot Pro benchmark. Useful when you want structured grounding data alongside the raw image.

UI-TARS is a different approach entirely. It's an end-to-end vision-language model trained specifically for GUI interaction, not a pipeline wrapping a general model. UI-TARS-1.5 scores 61.6% on ScreenSpot Pro. Claude's computer use scores 27.7% on the same benchmark. That gap is real and worth noting if raw GUI navigation accuracy is your core requirement. UI-TARS-2 adds multi-turn RL with a data flywheel, so that gap may widen before it narrows.

For most product use cases, Claude's computer use integrates more cleanly into a harness that's already using the Anthropic stack. If GUI accuracy is the main thing you're selling, UI-TARS is worth a close look.

Context management

Text can be compressed. Summarize 10,000 words into 500 and the information survives. Images can't. A screenshot is always a screenshot-sized chunk of tokens. You can't losslessly compress a 1,334-token image into 200 tokens without losing the visual data.

This is the constraint that matters most in multimodal context management. Image-heavy harnesses hit limits faster and can't compress their way out the way text harnesses can.

A few things that actually work in production:

Keep a sliding window of the last 3 to 5 screenshots and drop the oldest each time a new one comes in. The model rarely needs to reference screen state from more than a few turns back.

Once a screenshot has been used for a click or decision, replace it in the messages array with a short text note: "Previous screen: Firefox showing the example.com login page." The visual data is gone but the reference survives. For most multi-turn tasks that's fine, and it saves the full token cost going forward.

Before injecting consecutive screenshots, check if they're nearly identical. A pixel diff or image hash catches most cases. If there's nothing new on the screen, there's no reason to pay the token cost.

Prompt caching is worth adding for images that stay stable throughout the session: a UI style guide, a reference diagram. Anthropic's caching cuts the cost by 90% on cache hits, but the prompt prefix has to stay stable across turns for it to work.

Resize before encoding. 1280x800 is plenty for most UI navigation tasks. Fewer pixels means fewer tokens.

For budget tracking across a session, a simple allocation policy works well: reserve 20% of context for model output, 30% for text tool results, and cap the remaining 50% on images with a hard limit on image count. Track the count explicitly in your messages loop and evict oldest images when the budget is exceeded.

💡 Build image token estimation into your context tracker separately from your text token counter. If you don't, you'll hit limits in ways that are confusing to debug because the text counter will look fine while the actual context is full.

Tool design for multimodal inputs

Tool descriptions need to be explicit about what type of data to pass. "Takes a screenshot and analyzes it" is not enough. The model reads descriptions to understand how to call the tool. Give it something concrete:

json

{
  "name": "capture_screen",
  "description": "Captures the current screen state. Returns an image content block of the screen. Call this when you need to see the current UI state before taking action.",
  "input_schema": {
    "type": "object",
    "properties": {},
    "required": []
  }
}

For tools that accept binary inputs, use file paths or file IDs in the schema. Never raw bytes. JSON can't carry binary data. The harness resolves paths to actual data internally.

json

{
  "name": "read_pdf",
  "description": "Reads a PDF file and extracts text and structure by page. Use when you need to read a document.",
  "input_schema": {
    "type": "object",
    "properties": {
      "file_path": {
        "type": "string",
        "description": "Absolute path to the PDF file"
      },
      "pages": {
        "type": "string",
        "description": "Page range to read, e.g. '1-5' or '3'. Omit for all pages."
      }
    },
    "required": ["file_path"]
  }
}

When a tool processes visual input, you have a choice: return the image or return a text description. Return the image when the model needs spatial reasoning (click coordinates, UI navigation) or when visual structure matters (charts, diagrams, handwriting). Return text when you're summarizing processed content or when the downstream task is text-based and the image itself isn't needed anymore.

A hybrid that returns both works well when you're not sure which the model will need:

python

def analyze_image_tool(file_path: str) -> list:
    description = vision_model.describe(file_path)
    image_block = read_image_as_block(file_path)
    return [
        image_block,
        {"type": "text", "text": f"Description: {description}"}
    ]

The cleanest way to handle all of this is through a protocol. Each modality (image, audio, PDF, URL) implements a shared ModalityInput protocol with a to_content_block() method. The harness loop doesn't care what type of input it's processing. It calls to_content_block() and appends the result. Adding a new modality is adding a new implementation, not changing the loop.

Framework support in 2026

Framework	Images	PDF	Audio	Video	Computer use
Claude (Anthropic)	Native (base64, URL, Files API)	Native, up to 100 pages visual	Not native; needs Whisper	Frame extraction only	Beta API
OpenAI (GPT-5.5, o3)	Native (`image_url` blocks)	Assistants API only	Native Realtime API	Frame extraction only	Responses API (GPT-5.4)
Google ADK (Gemini 3.1 Live)	Native	Native	Native bidirectional streaming	JPEG frames via `send_realtime()`	Not native
LangChain	Via `MultiModalPromptTemplate`	Document loaders	Integration layer	Video API integrations	Via tool plugins
LlamaIndex	Via document nodes	Strong; leading document processing	Limited	Limited	Limited
Pydantic-AI	`ImageUrl`, `BinaryContent`	`DocumentUrl`	Not native	Frame extraction only	Not native

Google ADK has the most complete real-time multimodal streaming of any major framework. Over 6 trillion tokens per month processed through Gemini via ADK. If you need bidirectional audio streaming without building it yourself, ADK is ahead. The tradeoff is a strong Google Cloud dependency and more complexity to self-host.

Pydantic-AI is the cleanest option for provider-agnostic multimodal. Tools return BinaryContent and ImageUrl types directly. The framework injects them into the messages array. Version 0.0.38+ covers full image and document handling across Anthropic, OpenAI, and Gemini.

LlamaIndex is worth calling out specifically for document work. Its PDF processing pipeline is deeper than what you'd build from scratch: OCR, layout analysis, table extraction, and tight integration with PyMuPDF4LLM. If the agent's primary job is document intelligence, LlamaIndex's tooling in this area is ahead of the others.

What's still missing

No framework fully solves multimodal context management yet. A few gaps that every production harness runs into:

No framework tells you the exact token cost of an image before you send it. You estimate. The estimates are close but not exact, and exact matters when you're managing context tightly.

Context management is still entirely manual. There's no built-in image-aware sliding window. You write the eviction logic, set the policy, handle the edge cases.

Cross-modal memory is unsolved. Remembering what was seen in an image for later reference without keeping the image in context requires summarizing and discarding the visual data. If the model needs to revisit a spatial detail from turn 10 at turn 40, you either keep the image in context the whole time or re-capture it.

Video is still not a first-class input type in any framework. All of them require frame extraction. None treat video as a stream the model can reason over continuously.

And returning audio from a tool, generated speech, for example, and having the harness play it is custom plumbing in every stack. There's no standard pattern for this yet.

These aren't complaints. They're just an honest read of where things are in mid-2026. The models have outpaced the harnesses. The harnesses are catching up.

The harness has to grow to meet the model

When Claude can see, the harness has to move images. When Claude can click, the harness has to execute clicks. When Claude can read a PDF, the harness has to understand that a 50-page document costs 150,000 tokens before any reasoning starts, and plan accordingly.

The model's capabilities aren't the bottleneck anymore. The harness is.

Next in the series: Harness Failure Modes, covering what actually breaks when the harness is poorly designed and how to catch it before it hits production.

References and sources

Anthropic documentation

OpenAI documentation

Google ADK

GUI agents and computer use

PDF tools and processing

Pydantic-AI multimodal

Research papers

Context management

Voice and audio

Python libraries and frameworks

Pricing and token costs

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.