Claude Code token cost on desktop automation: screenshots vs UIA
by Fireal Software · ~7 min read
I ran the same Windows automation task with Claude Code two ways: (1) screenshots-first, the way most people naturally prompt it, and (2) UIA-first, with eyehands’ SKILL.md loaded. The difference was ~4× on token cost. This post walks through the exact numbers and where they come from.
The task: open Settings, navigate to “Apps → Default apps”, find “Web browser”, and change it to Firefox. Then verify the change by reading the label. About 8 distinct UI interactions.
Approach 1: screenshots-first
This is what Claude Code defaults to if you just tell it “change my default browser to Firefox” and it has Chrome-MCP or any vision-capable tool. It:
- Takes a screenshot of the desktop
- Analyzes the pixels to find the Start menu
- Clicks at guessed coordinates
- Screenshots the new state to verify the click landed
- Analyzes the Start menu pixels to find “Settings”
- Clicks at guessed coordinates
- Screenshots to verify Settings opened
- Analyzes Settings to find “Apps” category
- … and so on
Measured token cost for the full task: ~34,000 tokens (input + output, across all interactions).
Breakdown:
- 14 screenshots × ~1500 input tokens each = ~21,000 image tokens
- ~10,000 tokens of completion across the analysis and reasoning steps
- ~3,000 tokens of other tool calls (bash, file reads, etc.)
At Sonnet 4.6 pricing, that’s roughly $0.40 per run. Not catastrophic, but it adds up over a day.
Approach 2: UIA-first with eyehands SKILL.md
With the skill installed (eyehands --install-skill), Claude Code’s priority order becomes UIA → OCR → screenshots. For the same task:
GET /ui/find?name=Start→ returns coordinates of the Start buttonPOST /click_atwith those coordinatesPOST /click_and_wait→ returns{"changed": true}when the menu opensGET /ui/find?name=Settings→ returns Settings button coordinatesPOST /click_at+/click_and_waitGET /ui/find?name=Apps&window=Settings→ finds the Apps sidebar item- …etc
Measured token cost for the full task: ~8,000 tokens (input + output).
Breakdown:
- 2 screenshots (for the one part where UIA couldn’t find the dropdown: a list-search picker) × ~1500 = ~3,000 image tokens
- ~4,000 tokens of JSON tool call / response traffic
- ~1,000 tokens of reasoning and other tool calls
At Sonnet 4.6 pricing, that’s roughly $0.10 per run. 4× cheaper than screenshots-only.
Where the savings come from
1. UIA calls return JSON, not pixels. {"name": "Settings", "rect": [45, 720, 120, 770], "center": [82, 745]} is ~80 tokens. A screenshot is ~1500. Every time Claude can find an element via UIA instead of vision, it saves ~1400 tokens.
2. OCR runs locally. When UIA doesn’t work and eyehands falls back to /find?text=..., the OCR runs on your machine and only the JSON result goes back to Claude. The agent never sees the pixels. Savings: same ~1400 tokens per lookup.
3. Frame caching. eyehands caches OCR results per frame hash. If Claude calls /find five times while reasoning about the next step, only the first actually runs EasyOCR — the other four return the cached result. At ~150 tokens of JSON per call, that’s still way cheaper than five screenshots.
4. click_and_wait eliminates verification screenshots. After clicking, Claude can call POST /click_and_wait and get {"changed": true} back. No screenshot needed to “see if the click worked”. Savings: one screenshot per interaction (~1500 tokens).
The counterintuitive result
The biggest savings don’t come from the obvious place. I expected the wins to be from “don’t screenshot at all”. What actually happened:
- Half the screenshots Claude was taking were verification. “Did my click work?” The
click_and_waitendpoint eliminated those entirely. - The other half were “what’s on screen now?” Most of those were answered by a single UIA tree walk that returned all elements in JSON.
Actual mid-task screenshots (ones where I genuinely needed visual state) were rare — only 2 out of 14 in my test. The other 12 were replaceable with UIA calls or click_and_wait.
When screenshots were legitimately needed
In the one test where I had to screenshot:
- There was a dropdown with a custom-rendered list that wasn’t exposed through UIA (the “Suggested apps” carousel in Settings on Windows 11)
- I had to see the actual rendered text to find the Firefox entry
Even then, /latest with frame-hash polling would have been cheaper than /screenshot because the frame buffer was already populated from the background capture thread. Using /screenshot forces a fresh on-demand capture, while /latest returns whatever the 20-fps background capture has already grabbed.
What your own numbers will look like
Your mileage will vary wildly depending on:
- How many targets have accessibility trees. Native Windows apps: mostly UIA-friendly. Electron apps: mixed. Games: not at all.
- How aggressively the agent verifies actions. Claude Code’s default is “screenshot to verify”. With
click_and_wait, verification becomes free. - Whether you cache. Without frame-hash caching, every
/findre-OCRs from scratch. eyehands caches; most ad-hoc wiring doesn’t. - Pricing tier. On Haiku the per-token cost is lower, but the ratio between screenshots and JSON is the same.
If you’re spending more than $0.50 per Claude Code session on Windows desktop automation, UIA-first is almost certainly a 3–5× reduction.
The install
pip install eyehands
eyehands --install-skill # bakes the priority order into Claude Code
eyehands # starts the server
The --install-skill step is what makes the savings automatic. Without it, you have to manually prompt Claude to use UIA over screenshots on every task.
Links
- eyehands repo: https://github.com/shameindemgg/eyehands
- eyehands docs: https://eyehands.fireal.dev
*My numbers are from a single machine, a single Claude Code version, and a small sample of tasks. If your real-world measurements look different, [open an issue](https://github.com/shameindemgg/eyehands/issues) and I'll update this post — I'd rather have accurate public numbers than flattering ones.*
Give Claude eyes and hands on Windows
eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.
Try eyehands