Using Windows UI Automation from Python without the COM pain
by Fireal Software · ~9 min read
Windows UI Automation (UIA) is the accessibility tree Windows exposes for every native app. Screen readers use it. Accessibility testing tools use it. It’s also the cleanest way to find UI elements programmatically — no OCR, no pixel matching, no vision model. Just “find me the button named OK in the Settings window” and Windows tells you where it is.
The catch: the official way to access UIA is through COM, and the COM API is deeply unpleasant to use from Python. This post is about how to use UIA without that pain, and when to prefer it over alternatives.
What UIA exposes
Every native Windows control is a node in the UIA tree with properties you can query:
Name: the accessible name (“OK”, “Save”, “File menu”)ControlType:ButtonControl,EditControl,ListControl,TabItemControl, etc.AutomationId: a developer-assigned ID (often the internal control name)BoundingRectangle: the control’s position on screenValue: the current value (for text fields and sliders)IsEnabled,IsOffscreen,IsKeyboardFocusable: state
And every control has methods you can invoke:
Click(): programmatic click (when supported by the control pattern)SendKeys(): typing into the controlSetValue(): setting a text fieldSelect(): selecting a list item
The tree is hierarchical. Desktop → Application → Window → Control → Child controls. You can walk from the desktop down, or search by name/type at any level.
The painful way: uiautomation-python or comtypes
The “idiomatic” Python way is the uiautomation package (https://pypi.org/project/uiautomation/), which wraps the COM API. It works, but:
import uiautomation as auto
# Find the Settings window and click "Apps"
auto.InitializeUIAutomationInCurrentThread()
window = auto.WindowControl(Name="Settings")
window.SetActive()
apps = window.ListItemControl(Name="Apps")
apps.Click(simulateMove=False)
This works. But you have to:
- Initialize COM per thread. Call
InitializeUIAutomationInCurrentThreadon every thread that wants to use UIA. Miss this and you get crypticCoInitializeerrors. - Uninitialize on shutdown. Call
UninitializeUIAutomationInCurrentThreadon thread exit or you’ll leak. - Handle the COM apartment model. STA vs MTA matters. Call the wrong initialization and you get deadlocks.
- Wrap every call in try/except. COM errors surface as Python exceptions with hex error codes you have to look up.
- Deal with late binding. Many UIA methods are only available if the control supports the relevant “pattern”.
Click()requires the control to supportInvokePattern. If it doesn’t, the method doesn’t exist and you getAttributeError.
For a standalone Python script it’s tolerable. For a long-running server that handles requests from multiple threads, it’s a footgun minefield.
The eyehands way: HTTP over UIA
eyehands wraps all of this behind HTTP endpoints. You don’t touch COM. You don’t handle thread initialization. You don’t deal with apartment models. You call GET /ui/find?name=OK and get back JSON.
TOKEN=$(cat .eyehands-token)
# Find the OK button in any window
curl -H "Authorization: Bearer $TOKEN" \
"http://127.0.0.1:7331/ui/find?name=OK"
# {"ok": true, "matches": [{"name": "OK", "control_type": "ButtonControl",
# "rect": [380, 230, 430, 260],
# "center": [405, 245],
# "is_enabled": true}]}
# Click it
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "OK"}' \
"http://127.0.0.1:7331/ui/click_element"
# {"ok": true}
Under the hood, eyehands loads the uiautomation package lazily (first call to any /ui/* endpoint), initializes COM once in the thread that’s handling the request (via Handler.setup() / finish() per-thread hooks on the HTTP handler), and balances CoInitialize / CoUninitialize automatically.
The /ui/* endpoint set
eyehands has five UIA endpoints:
GET /ui/windows — list all top-level windows. Returns an array of window objects with title, class_name, handle, rect.
GET /ui/find — search for elements. Supports name, control_type, window_title, automation_id, depth, max_results. Returns matching elements with their coordinates.
GET /ui/at?x=...&y=... — get the element at specific screen coordinates. Useful for “what’s under the cursor”.
GET /ui/tree?window_title=...&depth=5 — get the full UIA tree of a window, up to depth levels deep. Useful for exploring unknown apps. Returns a nested JSON structure.
POST /ui/click_element — find and click in one call. Body: {"name": "...", "control_type": "...", "button": "left", "double": false}.
That’s it. Five endpoints cover 95% of what you’d do with the raw uiautomation package, and you never have to think about COM.
When UIA is the right tool
- Native Windows apps. Notepad, File Explorer, Settings, Control Panel, most WinForms/WPF apps — all have solid UIA trees.
- Most Electron apps. Chromium exposes accessibility through UIA, so Electron apps (VS Code, Slack, Discord) can be walked.
- Browser controls. The browser chrome itself (address bar, tabs, buttons) is in the UIA tree. Web page content isn’t — that’s the DOM.
- Installer wizards. MSI installers and setup.exe apps usually expose UIA.
When UIA fails
- Games. DirectX/OpenGL/Vulkan rendered content isn’t in the accessibility tree. Use OCR via
/findor screenshots. - Canvas-rendered web apps. If a web app renders its UI into a
<canvas>element, UIA sees a single control. Use OCR. - Apps that deliberately disable UIA. Some banking apps and games disable accessibility to prevent automation. You can’t work around this with UIA; you’d need OCR or vision.
- Legacy Win16/GDI apps. The old, old stuff. UIA support depends on whether the app uses Windows Common Controls or its own rendering.
The practical workflow
For any new Windows app you’re automating:
Start with GET /ui/tree?window_title=YourApp&depth=5. This gives you a JSON dump of every control in the app. Scroll through it, find the controls you care about, note their names and types.
For each target, use GET /ui/find?name=...&control_type=.... Confirm the element is reachable.
Use POST /ui/click_element to click. Pass the same name/control_type you used to find it.
If UIA doesn’t see the element, fall back to /find?text=... (OCR). This covers the custom-rendered controls.
If OCR also fails, fall back to screenshots. Last resort, not first.
This workflow is what eyehands’ packaged SKILL.md teaches Claude Code. Once it’s installed, Claude follows the priority order without you prompting it.
A worked example: find and click “OK” in any dialog
TOKEN=$(cat .eyehands-token)
# Get all buttons named OK anywhere on screen
curl -H "Authorization: Bearer $TOKEN" \
"http://127.0.0.1:7331/ui/find?name=OK&control_type=ButtonControl"
# Click the first one
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "OK", "control_type": "ButtonControl"}' \
"http://127.0.0.1:7331/ui/click_element"
Three lines. No COM. No initialization. No thread management. This is the main reason eyehands exists — not to replace uiautomation, but to wrap it in a shape AI agents can actually use.
Install
pip install eyehands[ui] # includes the uiautomation dep for /ui/* endpoints
eyehands --install-skill
eyehands
Links
- eyehands repo: https://github.com/shameindemgg/eyehands
uiautomationPython package: https://pypi.org/project/uiautomation/- Microsoft UI Automation reference: https://learn.microsoft.com/en-us/windows/win32/winauto/entry-uiauto-win32
- Inspect.exe (microsoft UIA explorer): https://learn.microsoft.com/en-us/windows/win32/winauto/inspect-objects
*Microsoft ships a tool called `Inspect.exe` (part of the Windows SDK) that lets you hover over any control and see its UIA properties live. If you're exploring an unfamiliar app, it's worth installing the SDK just for this.*
Give Claude eyes and hands on Windows
eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.
Try eyehands