Claude Code token cost on desktop automation: screenshots vs UIA

by Fireal Software · ~7 min read

I ran the same Windows automation task with Claude Code two ways: (1) screenshots-first, the way most people naturally prompt it, and (2) UIA-first, with eyehands’ SKILL.md loaded. The difference was ~4× on token cost. This post walks through the exact numbers and where they come from.

The task: open Settings, navigate to “Apps → Default apps”, find “Web browser”, and change it to Firefox. Then verify the change by reading the label. About 8 distinct UI interactions.

Approach 1: screenshots-first

This is what Claude Code defaults to if you just tell it “change my default browser to Firefox” and it has Chrome-MCP or any vision-capable tool. It:

Measured token cost for the full task: ~34,000 tokens (input + output, across all interactions).

Breakdown:

At Sonnet 4.6 pricing, that’s roughly $0.40 per run. Not catastrophic, but it adds up over a day.

Approach 2: UIA-first with eyehands SKILL.md

With the skill installed (eyehands --install-skill), Claude Code’s priority order becomes UIA → OCR → screenshots. For the same task:

Measured token cost for the full task: ~8,000 tokens (input + output).

Breakdown:

At Sonnet 4.6 pricing, that’s roughly $0.10 per run. 4× cheaper than screenshots-only.

Where the savings come from

1. UIA calls return JSON, not pixels. {"name": "Settings", "rect": [45, 720, 120, 770], "center": [82, 745]} is ~80 tokens. A screenshot is ~1500. Every time Claude can find an element via UIA instead of vision, it saves ~1400 tokens.

2. OCR runs locally. When UIA doesn’t work and eyehands falls back to /find?text=..., the OCR runs on your machine and only the JSON result goes back to Claude. The agent never sees the pixels. Savings: same ~1400 tokens per lookup.

3. Frame caching. eyehands caches OCR results per frame hash. If Claude calls /find five times while reasoning about the next step, only the first actually runs EasyOCR — the other four return the cached result. At ~150 tokens of JSON per call, that’s still way cheaper than five screenshots.

4. click_and_wait eliminates verification screenshots. After clicking, Claude can call POST /click_and_wait and get {"changed": true} back. No screenshot needed to “see if the click worked”. Savings: one screenshot per interaction (~1500 tokens).

The counterintuitive result

The biggest savings don’t come from the obvious place. I expected the wins to be from “don’t screenshot at all”. What actually happened:

Actual mid-task screenshots (ones where I genuinely needed visual state) were rare — only 2 out of 14 in my test. The other 12 were replaceable with UIA calls or click_and_wait.

When screenshots were legitimately needed

In the one test where I had to screenshot:

Even then, /latest with frame-hash polling would have been cheaper than /screenshot because the frame buffer was already populated from the background capture thread. Using /screenshot forces a fresh on-demand capture, while /latest returns whatever the 20-fps background capture has already grabbed.

What your own numbers will look like

Your mileage will vary wildly depending on:

If you’re spending more than $0.50 per Claude Code session on Windows desktop automation, UIA-first is almost certainly a 3–5× reduction.

The install

pip install eyehands
eyehands --install-skill       # bakes the priority order into Claude Code
eyehands                       # starts the server

The --install-skill step is what makes the savings automatic. Without it, you have to manually prompt Claude to use UIA over screenshots on every task.


*My numbers are from a single machine, a single Claude Code version, and a small sample of tasks. If your real-world measurements look different, [open an issue](https://github.com/shameindemgg/eyehands/issues) and I'll update this post — I'd rather have accurate public numbers than flattering ones.*

Give Claude eyes and hands on Windows

eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.

Try eyehands