Giving Claude eyes: OCR vs UI Automation vs screenshots

by Fireal Software · ~8 min read

There are three ways an AI agent can “see” what’s on a Windows desktop, and they have dramatically different cost/reliability profiles. If you’re wiring Claude Code or any LLM into desktop automation, knowing when to use each is the single biggest lever on your token bill and your success rate.

This post breaks down the trade-offs with real numbers and recommends a priority order.

The three approaches

1. UI Automation (UIA)

Windows exposes an accessibility tree through a COM interface called UI Automation. Every native control — buttons, text fields, menus, list items — is a node in this tree, with properties like Name, ControlType, BoundingRectangle, and Value. Screen readers like Narrator and NVDA use this tree to tell blind users what’s on the screen.

An agent can enumerate the tree, find a node by name or control type, and get its exact pixel coordinates — without ever looking at the pixels.

Cost: Near-zero tokens. One function call returns JSON like {"name": "OK", "rect": [380, 230, 430, 260], "center": [405, 245]}. That’s ~100 tokens of JSON total.

Reliability: Highest. Windows is telling you where the button is. No vision model, no OCR, no guessing.

Works for: Native Windows apps (Notepad, Settings, Control Panel, File Explorer), WinForms and WPF applications, most Electron apps (they expose a UIA tree via Chromium’s accessibility layer), legacy COM apps.

Fails for: Games using DirectX/OpenGL/Vulkan directly, apps that disable UIA for security reasons, canvas-rendered web apps, apps with broken accessibility (sadly common).

2. OCR

Run optical character recognition against a screenshot and return the pixel coordinates of matching text. Modern OCR (EasyOCR, Tesseract v5, PaddleOCR) is pretty good — 90%+ on printed text in most fonts.

Cost: Low. The OCR runs on your local machine, not in the LLM. The agent sends a query string and receives back JSON coordinates — ~150 tokens total. No image goes to the LLM.

Reliability: Medium-high for clear text. OCR handles regular fonts at normal sizes well. Struggles with tiny text, decorative fonts, low contrast, and complex backgrounds.

Works for: Anything with visible, readable text. Games with text HUDs, web apps, PDF viewers, terminal emulators, dashboards, screen-reader-hostile apps that at least show text visually.

Fails for: Icon-only UIs, very small text (below ~10px), heavily antialiased text on variable backgrounds, non-Latin scripts in some OCR engines.

3. Screenshots + vision model

Send a screenshot to Claude’s (or another LLM’s) vision model and ask “where is the OK button?” The model analyzes the pixels and returns approximate coordinates.

Cost: High. A 1920×1080 JPEG at quality 80 costs ~1500 input tokens. You pay this every interaction, and you often need to screenshot twice (before and after action to verify the state changed).

Reliability: Medium. Vision models are good at this but not perfect. Off-by-10-pixel errors are common on dense UIs. Small buttons in high-DPI displays are especially error-prone.

Works for: Anything. This is the universal fallback — if the UI has any visible elements, a vision model can theoretically find them.

Fails for: Nothing conceptually. But it fails in practice more than you’d expect — small targets, antialiased edges, themed controls, uncommon UI patterns all trip it up.

The real token cost comparison

For a typical “click OK in a dialog” interaction:

ApproachInput tokensOutput tokensTotal
UIA (`/ui/click_element`)~100~30~130
OCR (`/find` → `/click_at`)~200~50~250
Screenshot + vision~1500~200~1700
Multiply by the number of interactions in a session and the difference becomes stark. A 20-interaction automation session:

These are rough numbers and your actual costs will vary, but the ratio is real: UIA is ~13× cheaper than screenshots, OCR is ~7× cheaper.

The reliability hierarchy

In order of most-to-least reliable:

The “failure mode” matters too. UIA fails loudly — the control isn’t there. OCR fails loudly — the text isn’t found. Vision models fail quietly — they return coordinates that are wrong, and you don’t know it until the click lands on the wrong button.

The right priority order

Based on cost and reliability:

This is the exact rule codified in eyehands’ packaged SKILL.md. When the skill is installed, Claude Code will default to this order without you having to remind it every session.

When screenshots are actually the right choice

There are legitimate cases for the vision-model path:

For these cases, eyehands has /screenshot, /latest, and /view (a live HTML page with click-through). The key is to use them deliberately, not as the default.

The quick decision flowchart

Do you need to click/read a native Windows control?
  → Yes: try /ui/click_element or /ui/find first
  → No, it's non-native: go to step 2

Does the target have visible text you can name?
  → Yes: use /find?text=...
  → No: go to step 3

Do you need to see pixels to answer the question?
  → Yes: use /screenshot or /latest
  → No: you probably can't automate this without more context

Install

pip install eyehands
eyehands --install-skill   # teaches Claude the priority order
eyehands

*The UIA endpoints (`/ui/*`) are in eyehands' Pro tier ($19 one-time). The OCR endpoint (`/find`) and screenshot endpoints are free. If you want the full priority chain — including the "cheapest, most reliable" UIA path — Pro is where you'll spend the least on Claude tokens over time.*

Give Claude eyes and hands on Windows

eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.

Try eyehands