Claude Code token cost on desktop automation: screenshots vs UIA

Deep dives · ~7 min read · 11 April 2026

by Fireal Software · ~7 min read

I ran the same Windows automation task with Claude Code two ways: (1) screenshots-first, the way most people naturally prompt it, and (2) UIA-first, with eyehands’ SKILL.md loaded. The difference was ~4× on token cost. This post walks through the exact numbers and where they come from.

The task: open Settings, navigate to “Apps → Default apps”, find “Web browser”, and change it to Firefox. Then verify the change by reading the label. About 8 distinct UI interactions.

Approach 1: screenshots-first

This is what Claude Code defaults to if you just tell it “change my default browser to Firefox” and it has Chrome-MCP or any vision-capable tool. It:

Takes a screenshot of the desktop
Analyzes the pixels to find the Start menu
Clicks at guessed coordinates
Screenshots the new state to verify the click landed
Analyzes the Start menu pixels to find “Settings”
Clicks at guessed coordinates
Screenshots to verify Settings opened
Analyzes Settings to find “Apps” category
… and so on

Measured token cost for the full task: ~34,000 tokens (input + output, across all interactions).

Breakdown:

14 screenshots × ~1500 input tokens each = ~21,000 image tokens
~10,000 tokens of completion across the analysis and reasoning steps
~3,000 tokens of other tool calls (bash, file reads, etc.)

At Sonnet 4.6 pricing, that’s roughly $0.40 per run. Not catastrophic, but it adds up over a day.

Approach 2: UIA-first with eyehands SKILL.md

With the skill installed (eyehands --install-skill), Claude Code’s priority order becomes UIA → OCR → screenshots. For the same task:

GET /ui/find?name=Start → returns coordinates of the Start button
POST /click_at with those coordinates
POST /click_and_wait → returns {"changed": true} when the menu opens
GET /ui/find?name=Settings → returns Settings button coordinates
POST /click_at + /click_and_wait
GET /ui/find?name=Apps&window=Settings → finds the Apps sidebar item
…etc

Measured token cost for the full task: ~8,000 tokens (input + output).

Breakdown:

2 screenshots (for the one part where UIA couldn’t find the dropdown: a list-search picker) × ~1500 = ~3,000 image tokens
~4,000 tokens of JSON tool call / response traffic
~1,000 tokens of reasoning and other tool calls

At Sonnet 4.6 pricing, that’s roughly $0.10 per run. 4× cheaper than screenshots-only.

Where the savings come from

1. UIA calls return JSON, not pixels. {"name": "Settings", "rect": [45, 720, 120, 770], "center": [82, 745]} is ~80 tokens. A screenshot is ~1500. Every time Claude can find an element via UIA instead of vision, it saves ~1400 tokens.

2. OCR runs locally. When UIA doesn’t work and eyehands falls back to /find?text=..., the OCR runs on your machine and only the JSON result goes back to Claude. The agent never sees the pixels. Savings: same ~1400 tokens per lookup.

3. Frame caching. eyehands caches OCR results per frame hash. If Claude calls /find five times while reasoning about the next step, only the first actually runs EasyOCR — the other four return the cached result. At ~150 tokens of JSON per call, that’s still way cheaper than five screenshots.

4. click_and_wait eliminates verification screenshots. After clicking, Claude can call POST /click_and_wait and get {"changed": true} back. No screenshot needed to “see if the click worked”. Savings: one screenshot per interaction (~1500 tokens).

The counterintuitive result

The biggest savings don’t come from the obvious place. I expected the wins to be from “don’t screenshot at all”. What actually happened:

Half the screenshots Claude was taking were verification. “Did my click work?” The click_and_wait endpoint eliminated those entirely.
The other half were “what’s on screen now?” Most of those were answered by a single UIA tree walk that returned all elements in JSON.

Actual mid-task screenshots (ones where I genuinely needed visual state) were rare — only 2 out of 14 in my test. The other 12 were replaceable with UIA calls or click_and_wait.

When screenshots were legitimately needed

In the one test where I had to screenshot:

There was a dropdown with a custom-rendered list that wasn’t exposed through UIA (the “Suggested apps” carousel in Settings on Windows 11)
I had to see the actual rendered text to find the Firefox entry

Even then, /latest with frame-hash polling would have been cheaper than /screenshot because the frame buffer was already populated from the background capture thread. Using /screenshot forces a fresh on-demand capture, while /latest returns whatever the 20-fps background capture has already grabbed.

What your own numbers will look like

Your mileage will vary wildly depending on:

How many targets have accessibility trees. Native Windows apps: mostly UIA-friendly. Electron apps: mixed. Games: not at all.
How aggressively the agent verifies actions. Claude Code’s default is “screenshot to verify”. With click_and_wait, verification becomes free.
Whether you cache. Without frame-hash caching, every /find re-OCRs from scratch. eyehands caches; most ad-hoc wiring doesn’t.
Pricing tier. On Haiku the per-token cost is lower, but the ratio between screenshots and JSON is the same.

If you’re spending more than $0.50 per Claude Code session on Windows desktop automation, UIA-first is almost certainly a 3–5× reduction.

The install

pip install eyehands
eyehands --install-skill       # bakes the priority order into Claude Code
eyehands                       # starts the server

The --install-skill step is what makes the savings automatic. Without it, you have to manually prompt Claude to use UIA over screenshots on every task.

Links

eyehands repo: https://github.com/shameindemgg/eyehands
eyehands docs: https://eyehands.fireal.dev

*My numbers are from a single machine, a single Claude Code version, and a small sample of tasks. If your real-world measurements look different, [open an issue](https://github.com/shameindemgg/eyehands/issues) and I'll update this post — I'd rather have accurate public numbers than flattering ones.*

Give Claude eyes and hands on Windows

eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.

Try eyehands