Using Windows UI Automation from Python without the COM pain

by Fireal Software · ~9 min read

Windows UI Automation (UIA) is the accessibility tree Windows exposes for every native app. Screen readers use it. Accessibility testing tools use it. It’s also the cleanest way to find UI elements programmatically — no OCR, no pixel matching, no vision model. Just “find me the button named OK in the Settings window” and Windows tells you where it is.

The catch: the official way to access UIA is through COM, and the COM API is deeply unpleasant to use from Python. This post is about how to use UIA without that pain, and when to prefer it over alternatives.

What UIA exposes

Every native Windows control is a node in the UIA tree with properties you can query:

And every control has methods you can invoke:

The tree is hierarchical. Desktop → Application → Window → Control → Child controls. You can walk from the desktop down, or search by name/type at any level.

The painful way: uiautomation-python or comtypes

The “idiomatic” Python way is the uiautomation package (https://pypi.org/project/uiautomation/), which wraps the COM API. It works, but:

import uiautomation as auto

# Find the Settings window and click "Apps"
auto.InitializeUIAutomationInCurrentThread()
window = auto.WindowControl(Name="Settings")
window.SetActive()
apps = window.ListItemControl(Name="Apps")
apps.Click(simulateMove=False)

This works. But you have to:

For a standalone Python script it’s tolerable. For a long-running server that handles requests from multiple threads, it’s a footgun minefield.

The eyehands way: HTTP over UIA

eyehands wraps all of this behind HTTP endpoints. You don’t touch COM. You don’t handle thread initialization. You don’t deal with apartment models. You call GET /ui/find?name=OK and get back JSON.

TOKEN=$(cat .eyehands-token)

# Find the OK button in any window
curl -H "Authorization: Bearer $TOKEN" \
  "http://127.0.0.1:7331/ui/find?name=OK"
# {"ok": true, "matches": [{"name": "OK", "control_type": "ButtonControl",
#                           "rect": [380, 230, 430, 260],
#                           "center": [405, 245],
#                           "is_enabled": true}]}

# Click it
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "OK"}' \
  "http://127.0.0.1:7331/ui/click_element"
# {"ok": true}

Under the hood, eyehands loads the uiautomation package lazily (first call to any /ui/* endpoint), initializes COM once in the thread that’s handling the request (via Handler.setup() / finish() per-thread hooks on the HTTP handler), and balances CoInitialize / CoUninitialize automatically.

The /ui/* endpoint set

eyehands has five UIA endpoints:

GET /ui/windows — list all top-level windows. Returns an array of window objects with title, class_name, handle, rect.

GET /ui/find — search for elements. Supports name, control_type, window_title, automation_id, depth, max_results. Returns matching elements with their coordinates.

GET /ui/at?x=...&y=... — get the element at specific screen coordinates. Useful for “what’s under the cursor”.

GET /ui/tree?window_title=...&depth=5 — get the full UIA tree of a window, up to depth levels deep. Useful for exploring unknown apps. Returns a nested JSON structure.

POST /ui/click_element — find and click in one call. Body: {"name": "...", "control_type": "...", "button": "left", "double": false}.

That’s it. Five endpoints cover 95% of what you’d do with the raw uiautomation package, and you never have to think about COM.

When UIA is the right tool

When UIA fails

The practical workflow

For any new Windows app you’re automating:

Start with GET /ui/tree?window_title=YourApp&depth=5. This gives you a JSON dump of every control in the app. Scroll through it, find the controls you care about, note their names and types.

For each target, use GET /ui/find?name=...&control_type=.... Confirm the element is reachable.

Use POST /ui/click_element to click. Pass the same name/control_type you used to find it.

If UIA doesn’t see the element, fall back to /find?text=... (OCR). This covers the custom-rendered controls.

If OCR also fails, fall back to screenshots. Last resort, not first.

This workflow is what eyehands’ packaged SKILL.md teaches Claude Code. Once it’s installed, Claude follows the priority order without you prompting it.

A worked example: find and click “OK” in any dialog

TOKEN=$(cat .eyehands-token)

# Get all buttons named OK anywhere on screen
curl -H "Authorization: Bearer $TOKEN" \
  "http://127.0.0.1:7331/ui/find?name=OK&control_type=ButtonControl"

# Click the first one
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "OK", "control_type": "ButtonControl"}' \
  "http://127.0.0.1:7331/ui/click_element"

Three lines. No COM. No initialization. No thread management. This is the main reason eyehands exists — not to replace uiautomation, but to wrap it in a shape AI agents can actually use.

Install

pip install eyehands[ui]       # includes the uiautomation dep for /ui/* endpoints
eyehands --install-skill
eyehands

*Microsoft ships a tool called `Inspect.exe` (part of the Windows SDK) that lets you hover over any control and see its UIA properties live. If you're exploring an unfamiliar app, it's worth installing the SDK just for this.*

Give Claude eyes and hands on Windows

eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.

Try eyehands