Frame-hash polling: zero-token screen watching for AI agents

Deep dives · ~8 min read · 11 April 2026

by Fireal Software · ~8 min read

There’s a class of agent task that looks like “watch the screen and let me know when something changes”. Build finished? Tests passed? Upload done? Dialog appeared? The naive way to do this is to screenshot repeatedly and send each image to Claude for analysis. The cost adds up fast — a 2-minute wait at one screenshot per second is 120 images × ~1500 tokens each = 180,000 tokens just to sit and watch.

There’s a better way. eyehands’ /latest endpoint supports HTTP ETag / If-None-Match semantics via an X-Frame-Hash header. An agent can poll it as fast as it wants and pay zero image tokens for every frame that hasn’t changed.

This post is about how that works and how to use it.

The HTTP pattern

Here’s the flow:

# First call — no hash yet
curl -i "http://127.0.0.1:7331/latest"
# HTTP/1.1 200 OK
# X-Frame-Hash: 1728931847123
# Content-Type: image/jpeg
# [binary JPEG data]

# Subsequent call with the hash
curl -i -H 'If-None-Match: 1728931847123' "http://127.0.0.1:7331/latest"
# HTTP/1.1 304 Not Modified
# X-Frame-Hash: 1728931847123
# [empty body]

# When the screen changes, the hash updates
curl -i -H 'If-None-Match: 1728931847123' "http://127.0.0.1:7331/latest"
# HTTP/1.1 200 OK
# X-Frame-Hash: 1728931851456    <-- new hash
# [new JPEG data]

The agent’s loop:

Fetch /latest, save the X-Frame-Hash
Loop: fetch /latest with If-None-Match: <last_hash>
- 304 → nothing changed, keep waiting
- 200 → screen changed, analyze the new frame

Why this is zero-cost

The 304 response is 16 bytes of headers and an empty body. The agent sends a tiny GET request and receives a tiny 304 back. No image bytes cross the wire, no image tokens are billed to Claude. You can poll at 10 Hz for minutes and it costs effectively nothing.

Only when the screen actually changes does the agent get a real JPEG back — and at that point, it should pay the image tokens, because something new is happening and the agent needs to decide what to do.

What “change” means

The frame hash isn’t a content hash. It’s a monotonically increasing timestamp that updates every time the FrameBuffer commits a new frame. The background capture thread runs at 20 fps; every 50 ms, a new frame is grabbed from the DXGI/mss backend and the hash bumps.

So “change” in this system means “at least 50 ms have passed and the FrameBuffer got a new frame”. It’s technically not detecting visual change — it’s detecting time passing. If the screen is literally static (like a locked computer), the frame buffer will still commit new frames with new hashes.

There’s a way to get actual visual-change detection too: POST /click_and_wait compares the pre- and post-action pixel buffers and returns {"changed": true/false}. That’s content-level, not timestamp-level. Use it for “did my click actually do something?” style checks.

Use case: waiting for a build

Here’s a realistic agent prompt: “Run npm run build in this terminal and let me know when it finishes.”

With eyehands, the agent:

Types npm run build into the terminal
Fetches /latest and saves the hash
Polls /latest with If-None-Match every 2 seconds
When a 200 comes back, OCRs the new frame for “Build succeeded” or “Error”
Reports back to you

Total token cost for a 3-minute build: ~5 OCR calls (JSON only) + 1 image of the final state. Probably ~2000 tokens total. The naive “screenshot every second and send to Claude” approach would be ~270,000 tokens.

Use case: waiting for a dialog

“Open Settings, change my default browser, and click Yes if a UAC prompt appears.”

Agent opens Settings via /ui/click_element on the Start menu
Navigates to the default browser pane (all UIA, zero images)
Clicks “Set default” (UIA)
Starts polling /latest with If-None-Match
When the screen changes (UAC dialog appears), the agent gets a 200 and can OCR for “Yes” or “Allow”
Clicks the button via UIA once it’s enumerable

The polling step during the “waiting for UAC” period is effectively free. Agent just sits there sending GETs and receiving 304s until the dialog pops.

Implementing it in your own agent

If you’re wiring Claude Code or another agent to eyehands manually (not using the packaged SKILL.md), here’s the minimal loop:

import requests
import time

TOKEN = open(".eyehands-token").read().strip()
headers = {"Authorization": f"Bearer {TOKEN}"}

# Get initial hash
r = requests.get("http://127.0.0.1:7331/latest", headers=headers)
current_hash = r.headers["X-Frame-Hash"]

# Poll for changes
while True:
    time.sleep(0.5)
    r = requests.get(
        "http://127.0.0.1:7331/latest",
        headers={**headers, "If-None-Match": current_hash},
    )
    if r.status_code == 304:
        continue   # nothing changed, keep waiting
    if r.status_code == 200:
        current_hash = r.headers["X-Frame-Hash"]
        # Do something with r.content (the new JPEG)
        break

The eyehands SKILL.md teaches Claude Code to do this automatically when the prompt involves “wait for”, “watch for”, or “when X happens”. You don’t have to prompt the If-None-Match pattern explicitly — it’s in the skill.

The since= query parameter

As an alternative to If-None-Match, /latest?since=<hash> does the same thing as a GET query parameter. Useful for clients that find it awkward to set request headers:

curl "http://127.0.0.1:7331/latest?since=1728931847123"
# 304 if current hash matches, 200 with new image if it differs

Both paths do the same thing — pick whichever is more convenient for your HTTP client.

The /latest_b64 variant

If you’re calling from a context that can’t handle binary responses (some shells, some JSON-only transports), /latest_b64 returns the same data as a JSON response with data base64-encoded. The frame_hash field in the JSON lets you pass it to the next request. This is what Claude Code tends to use when the Bash tool is the transport.

curl "http://127.0.0.1:7331/latest_b64"
# {"ok": true, "frame_hash": 1728931847123, "data": "base64...",
#  "cursor_x": 500, "cursor_y": 300, ...}

The bottom line

Polling a screen for changes is a dumb pattern when you pay for every frame. With frame-hash polling and 304 Not Modified, it becomes nearly free — you only pay image tokens when something actually happens. This is the difference between “watching a build” costing $0.50 and costing $0.001.

Install

pip install eyehands
eyehands --install-skill
eyehands

Links

eyehands repo: https://github.com/shameindemgg/eyehands
MDN: If-None-Match header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-None-Match
MDN: ETag: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag

*Frame-hash polling is a small feature with a disproportionate cost impact. If you're building an agent that needs to wait for things to happen on screen, this is the difference between a workable product and a token-cost disaster.*

Give Claude eyes and hands on Windows

eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.

Try eyehands