Your harness works. The loop runs cleanly, tool calls resolve, context stays managed. Then someone asks: "Can it look at a screenshot?" And suddenly the whole thing has to change.
Not the model. Claude already knows how to see. The problem is that the harness was built for strings. A multimodal message is not a string. It's a typed content block. And once you make that shift, everything downstream moves with it: how tool results get injected, how context gets tracked, how you estimate token budgets, how you decide what to evict when things fill up.
This is post 07 in the Agent Harnesses series. Earlier posts cover what a harness is and how the loop works. This one covers what breaks when the model is multimodal and the harness isn't.
The messages array pivot
A text harness accumulates strings. The messages array is a list of role-content pairs where content is a string. That assumption runs deep: tool call parsing, context truncation, token counting all treat content as text.
In a multimodal harness, content is an array of typed blocks. Here's what a user message looks like once images enter the picture:
// Text-only message
{
"role": "user",
"content": "What's in this image?"
}
// Multimodal message - content is now an array of typed blocks
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "/9j/4AAQSkZJRg..."
}
},
{
"type": "text",
"text": "What's in this image?"
}
]
}
That shift from string to block array is the thing everything else depends on. Everything that touches the content field, appending to messages, constructing tool results, truncating history, needs to handle both formats. The harness can't assume strings anymore.
Tool results change too. In a text harness, a tool returns a string and you inject it into a tool_result message. In a multimodal harness, a screenshot tool returns an image block. The injection looks like this:
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": "toolu_01...",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": "..."
}
}
]
}
]
}
You construct this explicitly. There's no shortcut that turns a PNG into a message automatically. The harness owns this.
Three ways to pass images
Once you're building multimodal messages, you have three options for getting image data into the API call.
Base64 encoding is the most common. Encode the image bytes and embed them inline. Works for any image. The downside is that you resend the full bytes on every API call. In a multi-turn conversation, every turn carries the image again. At scale, that doubles your payload size per turn.
URL reference is simpler for images already hosted somewhere. Pass the URL and Claude fetches it. Works well for publicly accessible assets. Doesn't work for local files or anything behind auth.
The Files API is the right answer when the same image appears across multiple turns or sessions. Upload once, get a file_id, reference by ID. Anthropic caches the content on their side. The beta header anthropic-beta: files-api-2025-04-14 is required. Files stay stored up to 500 MB each and can be reused without resending bytes. For production harnesses, this is usually where you end up.
Images as tool input
The most common pattern is a screenshot tool. The model requests the current screen state, the harness captures it, converts it to base64, and injects an image block into the tool result. The model reasons about what it sees before deciding the next action.
import anthropic
import base64
import subprocess
def screenshot_tool() -> dict:
# macOS: screencapture -x -t png /tmp/screen.png
# Linux: scrot /tmp/screen.png
subprocess.run(["screencapture", "-x", "-t", "png", "/tmp/screen.png"])
with open("/tmp/screen.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
return {
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
}
}
The tool returns an image block. The harness puts it inside a tool_result content array and appends it to messages. Not as a string, as a block.
Image-to-code flows (Figma screenshot to React component, whiteboard diagram to architecture code, error screenshot to debug fix) don't need special harness handling. Feed the image as a user message, ask the model to generate code, get text back. The harness just passes the image block through. You only need custom handling when image capture is itself a tool call within the loop.
For reading local files, a small utility covers most cases:
import mimetypes
def read_image_tool(file_path: str) -> dict:
media_type = mimetypes.guess_type(file_path)[0] or "image/jpeg"
with open(file_path, "rb") as f:
data = base64.standard_b64encode(f.read()).decode("utf-8")
return {
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": data
}
}
Supported formats: JPEG, PNG, GIF, WebP. Max 8000x8000 pixels per image. Max 30 MB. If you're passing more than 20 images in a single request, that limit drops to 2000x2000.
Image token costs
Images bill at the same per-token rate as text. No image surcharge. But a 1000x1000 image costs roughly 1,334 tokens. That's not trivial when you're thinking about context budgets.
PDF pages processed visually run 1,500 to 3,000 tokens per page. A 10-page PDF can consume 30,000 tokens before the model writes a word. Those tokens aren't just cost. They're context window. You're trading space for visual information, and that space doesn't come back once it's used.
The harness needs to account for this explicitly. Text token counters ignore image costs. A rough approximation for Claude's image token cost:
def estimate_image_tokens(width: int, height: int) -> int:
tiles = ((width + 511) // 512) * ((height + 511) // 512)
return 85 + 170 * tiles # base cost plus per-tile cost
Add this to your context budget tracking. If you don't, you'll hit limits in ways that are hard to debug.
Images as tool output
The model can also call a tool that generates an image. A generate_image tool calls DALL-E or Stable Diffusion, gets back image bytes, and the harness injects the result as an image block. On the next turn, the model sees the generated image and can reason about it.
import openai
def generate_image_tool(prompt: str) -> dict:
client = openai.OpenAI()
response = client.images.generate(
model="dall-e-3",
prompt=prompt,
size="1024x1024",
response_format="b64_json"
)
image_data = response.data[0].b64_json
return {
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data
}
}
For self-hosted image generation (privacy, cost control), Stable Diffusion via the AUTOMATIC1111 API follows the same pattern. The tool calls the local API, gets base64 bytes, returns an image block.
The problem with image-out workflows is accumulation. A generate-inspect-critique-regenerate loop adds a full image-sized chunk of tokens on every iteration. Without eviction, you'll hit context limits after a handful of rounds. More on this in the context management section.
PDFs
When you send Claude a PDF (via Files API or as a base64 document), it converts each page into an image and extracts text. The model sees both the text and the visual layout of each page. That's why Claude performs better than a text-only extractor on layout-sensitive documents like invoices: field positions matter and the model can see them.
The token cost is real. Each page runs 1,500 to 3,000 tokens. A 50-page PDF is 75,000 to 150,000 tokens for ingestion alone. The limits to know: max 30 MB per PDF, max 100 pages for visual analysis. Over 1,000 pages, visual analysis is dropped and you get text only.
For repeated queries against the same PDF, Files API is the right move:
import anthropic
client = anthropic.Anthropic()
# Upload once
with open("document.pdf", "rb") as f:
response = client.beta.files.upload(
file=("document.pdf", f, "application/pdf"),
)
file_id = response.id
# Reference many times without resending bytes
message = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "file", "file_id": file_id}
},
{"type": "text", "text": "Summarize the key findings."}
]
}],
betas=["files-api-2025-04-14"]
)
For large documents, sending the full PDF is often the wrong call. A 200-page technical manual exceeds any context window if sent whole. Four strategies that work in practice:
A page-by-page tool works well. Expose a read_pdf_page(file, page_num) tool and let the model request only what it needs. It does better at this than you'd expect. It reads the table of contents on page one and jumps to relevant sections rather than reading linearly.
The pdf-mcp approach wraps the PDF with chunked reading, hybrid search, and SQLite caching inside an MCP server. The model queries semantically and gets relevant chunks back. Good for documents queried repeatedly across sessions.
Pre-extraction with PyMuPDF4LLM converts the PDF to Markdown, you chunk by section headers, embed the chunks, and retrieve top-k for context. This is the right approach for RAG pipelines where the document is queried by many different agents or users. For complex multi-column layouts, MinerU handles them better, though it's slower.
Prompt caching is worth adding when the same PDF gets queried multiple times in a session. Anthropic's caching cuts token cost by 90% on cache hits. It requires the prompt prefix to stay stable across turns.
Audio
Claude doesn't accept audio input natively. For audio in a harness, you have two patterns, and they make different tradeoffs.
Cascading: audio in, Whisper for speech-to-text, transcript string into the harness loop, text response out, ElevenLabs or Cartesia for text-to-speech, audio out. The harness stays text-native. Audio is only at the edges. You can inspect the transcript at any step, which makes debugging clean. The tradeoff is latency: you're chaining three models in sequence.
import openai
client = openai.OpenAI()
def transcribe_audio_tool(audio_file_path: str) -> str:
with open(audio_file_path, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="text"
)
return transcript # plain string, drops into messages array as text
The transcript is a string. It drops directly into the messages array as text. No special content block handling needed on the output side.
The second pattern is native speech-to-speech: OpenAI Realtime API or Gemini Live API. Audio in, model processes audio natively, audio out. Lower latency, preserves emotional prosody, harder to debug. Tool-call accuracy is slightly lower. ElevenLabs Flash v2.5 gets the end-to-end latency to around 75ms when you co-locate transcription, reasoning, and synthesis.
For most agent harnesses, cascading is the right call. The latency hit is real but manageable, and you get a clean transcript and full control over the reasoning chain. Native speech-to-speech makes more sense when the product is the conversation itself, not the task completion underneath it.
Video
Video is frames over time. No current model accepts raw video as a stream. Every video workflow comes down to the same thing: extract frames, pass frames as images.
The math is sobering. A 60-second video at 30 fps is 1,800 frames. At roughly 1,334 tokens per frame (1000x1000), that's 2.4 million tokens. No context window handles that. You have to choose which frames to send.
The practical limit for standard context windows: 3 to 5 frames per query. Extended context models (200k+) can handle more, but not dramatically more at image token rates.
import subprocess, os
def extract_video_frames_tool(video_path: str, fps: float = 0.5) -> list[str]:
"""Extract frames at given FPS. Returns list of image file paths."""
os.makedirs("/tmp/frames", exist_ok=True)
subprocess.run([
"ffmpeg", "-i", video_path,
"-vf", f"fps={fps}",
"-q:v", "2",
"/tmp/frames/frame_%04d.jpg",
"-y"
])
frames = sorted([
f"/tmp/frames/{f}" for f in os.listdir("/tmp/frames")
if f.endswith(".jpg")
])
return frames
0.5 fps on a 60-second clip gives you 30 frames, roughly 40,000 tokens. That's a meaningful chunk of context for a single tool result. The harness controls how many frames get injected. This is a design decision that belongs in your tool implementation, not something to leave the model to figure out.
For video where specific moments matter, scene detection (PySceneDetect or ffmpeg's scene filter) extracts frames only when the scene changes. That reduces frame count significantly on structured content like screen recordings and presentations.
Query-aware keyframe selection is worth knowing about. FOCUS selects frames relevant to the specific question being asked and reaches competitive accuracy with 40% of frames retained. Token compression approaches (STORM, FlashVLM) are moving through research toward production. For 2026, frame extraction plus scene detection is still the practical approach for most teams.
Long-form video (meeting recordings, lecture captures) needs an ingestion pipeline, not direct model feeding. The MMCTAgent pattern from Microsoft Research is a reasonable reference: transcribe the audio track, identify keyframes, chunk into semantic chapters, then run a multi-agent planner-critic loop for question answering. The model never sees raw video. It sees structured outputs from the pipeline.
Computer use
Computer use is a perception-action loop over a GUI. The model sees the screen, decides what to do, calls an action tool, the harness executes it, captures the new screen state, and the loop continues.
To be clear about one thing: Claude cannot execute actions itself. The harness implements every click, keystroke, and scroll. Claude decides what to do. The harness does it.
The beta header for current models (Opus 4.6, Sonnet 4.6) is computer-use-2025-11-24. Older header for Sonnet 3.7 and earlier: computer-use-2025-01-24. The spec defines three tools: computer for screenshot, click, type, key, scroll; text_editor for file operations; and bash for shell commands (both optional).
tools = [
{
"type": "computer_20251022",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1
}
# text_editor and bash are optional
]
A minimal computer use loop:
import anthropic, base64, subprocess
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Open Firefox and go to example.com"}]
while True:
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
betas=["computer-use-2025-11-24"]
)
if response.stop_reason == "end_turn":
break
for block in response.content:
if block.type == "tool_use" and block.name == "computer":
action = block.input["action"]
if action == "screenshot":
subprocess.run(["screencapture", "-x", "/tmp/screen.png"])
with open("/tmp/screen.png", "rb") as f:
data = base64.b64encode(f.read()).decode()
tool_result = {
"type": "tool_result",
"tool_use_id": block.id,
"content": [{
"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": data}
}]
}
elif action == "left_click":
coords = block.input["coordinate"]
subprocess.run(["cliclick", f"c:{coords[0]},{coords[1]}"])
tool_result = {"type": "tool_result", "tool_use_id": block.id, "content": "clicked"}
# handle other actions here
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": [tool_result]})
Anthropic publishes a reference Docker container with example tool implementations. That's the fastest way to get a working environment without building the system interaction layer from scratch.
If you're evaluating alternatives, the two worth knowing about in 2026 are OmniParser (Microsoft) and UI-TARS (ByteDance).
OmniParser converts screenshots into structured UI element descriptions before passing to the model. It uses YOLO for icon detection and Florence2 for recognition. Achieves 39.5% on the ScreenSpot Pro benchmark. Useful when you want structured grounding data alongside the raw image.
UI-TARS is a different approach entirely. It's an end-to-end vision-language model trained specifically for GUI interaction, not a pipeline wrapping a general model. UI-TARS-1.5 scores 61.6% on ScreenSpot Pro. Claude's computer use scores 27.7% on the same benchmark. That gap is real and worth noting if raw GUI navigation accuracy is your core requirement. UI-TARS-2 adds multi-turn RL with a data flywheel, so that gap may widen before it narrows.
For most product use cases, Claude's computer use integrates more cleanly into a harness that's already using the Anthropic stack. If GUI accuracy is the main thing you're selling, UI-TARS is worth a close look.
Context management
Text can be compressed. Summarize 10,000 words into 500 and the information survives. Images can't. A screenshot is always a screenshot-sized chunk of tokens. You can't losslessly compress a 1,334-token image into 200 tokens without losing the visual data.
This is the constraint that matters most in multimodal context management. Image-heavy harnesses hit limits faster and can't compress their way out the way text harnesses can.
A few things that actually work in production:
Keep a sliding window of the last 3 to 5 screenshots and drop the oldest each time a new one comes in. The model rarely needs to reference screen state from more than a few turns back.
Once a screenshot has been used for a click or decision, replace it in the messages array with a short text note: "Previous screen: Firefox showing the example.com login page." The visual data is gone but the reference survives. For most multi-turn tasks that's fine, and it saves the full token cost going forward.
Before injecting consecutive screenshots, check if they're nearly identical. A pixel diff or image hash catches most cases. If there's nothing new on the screen, there's no reason to pay the token cost.
Prompt caching is worth adding for images that stay stable throughout the session: a UI style guide, a reference diagram. Anthropic's caching cuts the cost by 90% on cache hits, but the prompt prefix has to stay stable across turns for it to work.
Resize before encoding. 1280x800 is plenty for most UI navigation tasks. Fewer pixels means fewer tokens.
For budget tracking across a session, a simple allocation policy works well: reserve 20% of context for model output, 30% for text tool results, and cap the remaining 50% on images with a hard limit on image count. Track the count explicitly in your messages loop and evict oldest images when the budget is exceeded.
Tool design for multimodal inputs
Tool descriptions need to be explicit about what type of data to pass. "Takes a screenshot and analyzes it" is not enough. The model reads descriptions to understand how to call the tool. Give it something concrete:
{
"name": "capture_screen",
"description": "Captures the current screen state. Returns an image content block of the screen. Call this when you need to see the current UI state before taking action.",
"input_schema": {
"type": "object",
"properties": {},
"required": []
}
}
For tools that accept binary inputs, use file paths or file IDs in the schema. Never raw bytes. JSON can't carry binary data. The harness resolves paths to actual data internally.
{
"name": "read_pdf",
"description": "Reads a PDF file and extracts text and structure by page. Use when you need to read a document.",
"input_schema": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Absolute path to the PDF file"
},
"pages": {
"type": "string",
"description": "Page range to read, e.g. '1-5' or '3'. Omit for all pages."
}
},
"required": ["file_path"]
}
}
When a tool processes visual input, you have a choice: return the image or return a text description. Return the image when the model needs spatial reasoning (click coordinates, UI navigation) or when visual structure matters (charts, diagrams, handwriting). Return text when you're summarizing processed content or when the downstream task is text-based and the image itself isn't needed anymore.
A hybrid that returns both works well when you're not sure which the model will need:
def analyze_image_tool(file_path: str) -> list:
description = vision_model.describe(file_path)
image_block = read_image_as_block(file_path)
return [
image_block,
{"type": "text", "text": f"Description: {description}"}
]
The cleanest way to handle all of this is through a protocol. Each modality (image, audio, PDF, URL) implements a shared ModalityInput protocol with a to_content_block() method. The harness loop doesn't care what type of input it's processing. It calls to_content_block() and appends the result. Adding a new modality is adding a new implementation, not changing the loop.
Framework support in 2026
| Framework | Images | Audio | Video | Computer use | |
|---|---|---|---|---|---|
| Claude (Anthropic) | Native (base64, URL, Files API) | Native, up to 100 pages visual | Not native; needs Whisper | Frame extraction only | Beta API |
| OpenAI (GPT-5.5, o3) | Native (image_url blocks) |
Assistants API only | Native Realtime API | Frame extraction only | Responses API (GPT-5.4) |
| Google ADK (Gemini 3.1 Live) | Native | Native | Native bidirectional streaming | JPEG frames via send_realtime() |
Not native |
| LangChain | Via MultiModalPromptTemplate |
Document loaders | Integration layer | Video API integrations | Via tool plugins |
| LlamaIndex | Via document nodes | Strong; leading document processing | Limited | Limited | Limited |
| Pydantic-AI | ImageUrl, BinaryContent |
DocumentUrl |
Not native | Frame extraction only | Not native |
Google ADK has the most complete real-time multimodal streaming of any major framework. Over 6 trillion tokens per month processed through Gemini via ADK. If you need bidirectional audio streaming without building it yourself, ADK is ahead. The tradeoff is a strong Google Cloud dependency and more complexity to self-host.
Pydantic-AI is the cleanest option for provider-agnostic multimodal. Tools return BinaryContent and ImageUrl types directly. The framework injects them into the messages array. Version 0.0.38+ covers full image and document handling across Anthropic, OpenAI, and Gemini.
LlamaIndex is worth calling out specifically for document work. Its PDF processing pipeline is deeper than what you'd build from scratch: OCR, layout analysis, table extraction, and tight integration with PyMuPDF4LLM. If the agent's primary job is document intelligence, LlamaIndex's tooling in this area is ahead of the others.
What's still missing
No framework fully solves multimodal context management yet. A few gaps that every production harness runs into:
No framework tells you the exact token cost of an image before you send it. You estimate. The estimates are close but not exact, and exact matters when you're managing context tightly.
Context management is still entirely manual. There's no built-in image-aware sliding window. You write the eviction logic, set the policy, handle the edge cases.
Cross-modal memory is unsolved. Remembering what was seen in an image for later reference without keeping the image in context requires summarizing and discarding the visual data. If the model needs to revisit a spatial detail from turn 10 at turn 40, you either keep the image in context the whole time or re-capture it.
Video is still not a first-class input type in any framework. All of them require frame extraction. None treat video as a stream the model can reason over continuously.
And returning audio from a tool, generated speech, for example, and having the harness play it is custom plumbing in every stack. There's no standard pattern for this yet.
These aren't complaints. They're just an honest read of where things are in mid-2026. The models have outpaced the harnesses. The harnesses are catching up.
The harness has to grow to meet the model
When Claude can see, the harness has to move images. When Claude can click, the harness has to execute clicks. When Claude can read a PDF, the harness has to understand that a 50-page document costs 150,000 tokens before any reasoning starts, and plan accordingly.
The model's capabilities aren't the bottleneck anymore. The harness is.
Next in the series: Harness Failure Modes, covering what actually breaks when the harness is poorly designed and how to catch it before it hits production.
References and sources
Anthropic documentation
OpenAI documentation
- OpenAI Vision and Images API
- OpenAI Speech to Text (Whisper)
- OpenAI Realtime API: Voice Agent Production Guide 2026
Google ADK
- ADK Streaming and Gemini Live API Toolkit
- Building a Multimodal Agent with ADK and Gemini Flash Live 3.1
GUI agents and computer use
- UI-TARS paper (ByteDance)
- UI-TARS Desktop (open source)
- OmniParser (Microsoft)
- Magma Foundation Model (Microsoft)
- Awesome GUI Agent List (showlab)
- GUI Agents Paper List (OSU NLP Group)
PDF tools and processing
- PyMuPDF4LLM documentation
- pdf-mcp: Solving Claude's large PDF limitations
- Best PDF-to-Markdown tools 2026
Pydantic-AI multimodal
Research papers
- When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression (TMLR 2026)
- FOCUS: Query-aware keyframe selection for long video
- MMCTAgent: Multimodal reasoning over large video collections (Microsoft Research)
- Token-efficient multimodal reasoning via image prompt packaging
Context management
- Managing context in long-run agentic applications (Slack Engineering)
- Context window management for multi-agent AI
Voice and audio
Python libraries and frameworks
- Python libraries for multimodal agents 2026 (GetStream)
- Best visual AI agents in 2026 (DEV Community)
Pricing and token costs
Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Bangalore.