How to make Claude Code click buttons in Windows apps

by Fireal Software · ~8 min read

Claude Code is excellent at editing files, running scripts, and grepping a repo. It is not great at clicking a button in Notepad, reading a legacy WinForms tool’s error dialog, or operating File Explorer — because its built-in tools stop at the terminal.

This post walks through the practical setup for teaching Claude Code to control native Windows applications: what to install, what to tell Claude, and three rules that dramatically reduce token cost.

The setup

Three commands:

pip install eyehands
eyehands --install-skill
eyehands

The first installs the eyehands package from PyPI. The second writes a SKILL.md file into ~/.claude/skills/eyehands/ — this is the Claude Code skill that teaches the agent how to use the server. The third starts the HTTP server on http://127.0.0.1:7331.

That’s the whole install. Claude Code will automatically load the skill on its next session and know how to call the server.

The three rules that matter

When you give Claude “click the OK button” as a task, it has three ways to do it. They are dramatically different in cost and reliability. The SKILL.md codifies them as a priority order:

Rule 1: Try UI Automation first

POST /ui/click_element with {"name": "OK"} asks Windows’ built-in accessibility tree where the OK button is and clicks it. No image processing, no OCR, no vision model — just a single call to UIAutomation.dll, a click at the returned coordinates, and a JSON response.

This works for ~80% of native Windows apps: Notepad, File Explorer, Settings, most WinForms/WPF tools, and many Electron apps.

When it fails: games, canvas-rendered web apps, DirectX overlays, apps with broken accessibility trees.

Rule 2: Fall back to OCR

GET /find?text=OK runs EasyOCR against the most recent cached screen frame and returns the pixel coordinates of any matching text.

This works whenever the element has visible text that OCR can read. Games with text HUDs, Electron apps with broken UIA, web apps, PDF viewers, dashboards. The OCR cache means the second /find call on an unchanged screen is instant — EasyOCR’s 3-second model load is paid once at startup, not on every call.

When it fails: icons with no text, very small fonts, heavily obfuscated UIs.

Rule 3: Screenshots are the last resort

If UIA and OCR both fail, then take a screenshot. GET /latest returns the most recent frame from the background buffer as JPEG. Claude can send that to its vision model and guess coordinates.

The reason screenshots are last: they’re 3–5× more expensive in tokens than /ui or /find, and vision-model coordinate guessing is less reliable than programmatic lookup.

What Claude actually does with the skill loaded

With the eyehands skill installed, here’s what a typical interaction looks like when you tell Claude “close the Settings window”:

Four JSON calls. Zero screenshots. Total token cost: roughly 400 tokens in + 200 tokens out. The equivalent “screenshot everything” approach would be ~3000–4000 tokens per interaction.

A real example: closing all browser tabs

Here’s a realistic prompt I’ve given Claude Code: “Close every tab in Chrome except the first one.”

Without eyehands, Claude would screenshot Chrome, send the image to its vision model, try to identify the X buttons on each tab, and send click coordinates. The vision model would get ~80% of them right and miss a few on the edges.

With eyehands and the skill loaded, Claude walks the UIA tree of Chrome’s window, finds all controls of type TabItemControl, and iterates — closing each one by its accessibility name. Zero screenshots, 100% accuracy.

# What Claude generates internally (roughly):
tabs = GET /ui/find?window=Chrome&control_type=TabItemControl
for tab in tabs[1:]:  # skip the first tab
    POST /ui/click_element {"name": tab["name"], "button": "middle"}

The quick reference

You want Claude to...Call
Click a button by name`POST /ui/click_element` with `{"name": "..."}`
Find text on screen`GET /find?text=...`
Click at coordinates`POST /click_at` with `{"x", "y"}`
Type text`POST /type_text` with `{"text": "..."}`
Press a key combo`POST /key` with `{"vk": ..., "modifiers": [...]}`
List open windows`GET /ui/windows`
Get the full UIA tree of a window`GET /ui/tree?window_title=...`
Take a screenshot`GET /screenshot`
Wait for a screen change`POST /click_and_wait`
## What to tell Claude when you prompt it Even with the skill loaded, it helps to be explicit in your prompts:

“Use eyehands’ /ui/click_element or /find before resorting to screenshots.”

“Don’t take a screenshot unless you’ve already tried UIA and OCR and they failed.”

“If you need to know what’s on the screen, try /ui/tree first — it’s cheaper than a screenshot.”

These reminders become unnecessary after Claude has done a few successful interactions with the skill, but they’re worth putting in a CLAUDE.md file for the project so every new session starts right.

Install

pip install eyehands
eyehands --install-skill
eyehands

*eyehands has a free tier (screen capture, mouse, keyboard, OCR) and a $19 one-time Pro tier (composite actions + UI Automation). The UI Automation endpoints are Pro — they're what makes the "click by name" pattern work. The free tier still handles the OCR and screenshot paths if you want to try before buying.*

Give Claude eyes and hands on Windows

eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.

Try eyehands