Building a local HTTP server for AI agent tool use

by Fireal Software · ~8 min read

When you want to give an AI agent (Claude Code, Cursor, a local LLM) the ability to do something beyond what its built-in tools support, you have a few architectural choices. You can ship a Python library the agent imports. You can spawn a subprocess for each action. You can write an MCP server. Or you can stand up a local HTTP server.

I chose the HTTP server path for eyehands, and I’d make the same choice again. This post is about why — and when it’s the wrong choice.

The four options

1. Python library the agent imports

The agent imports your module and calls functions directly. This is the most direct interface — no IPC, no serialization, no ports.

When it works: The agent is itself a Python process that can load your library. Some local LLM harnesses work this way.

When it doesn’t: Claude Code runs as a TypeScript Node.js process under the hood. It invokes tools via subprocess or through its own tool-use protocol. It can’t import a Python library. Same for most hosted agents.

2. CLI subprocess per action

The agent spawns your CLI tool, passes arguments, reads stdout, parses the result. Nearly every AI agent can do this (bash is the universal tool).

When it works: Actions are self-contained and stateless. Think ls, grep, curl.

When it doesn’t: Anything with startup cost. eyehands needs to load EasyOCR (~3 seconds), allocate a frame buffer, start DXGI capture, boot a Python interpreter. Doing that per-action would make every /find call take 3.5 seconds. For an agent making hundreds of calls in a session, this is unusable.

3. MCP server

Anthropic’s Model Context Protocol is the official answer to “how do I give Claude Code custom tools”. You write an MCP server that speaks JSON-RPC over stdio or HTTP, and Claude loads it into the session.

When it works: You’re on the Claude/Anthropic stack and your tool fits the MCP action model.

When it doesn’t: You want any agent to be able to use the tool, not just Claude. MCP is great for Claude-specific stuff but it’s a Claude-specific protocol. eyehands works from Claude Code, Cursor, a Python script, a bash one-liner, a Go CLI — anything with HTTP.

4. Local HTTP server

A process that runs on 127.0.0.1:PORT and exposes a REST API. Any agent that can make HTTP calls can use it.

When it works: You have persistent state, you want language-agnostic access, and you’re willing to run a separate process.

When it doesn’t: You want zero-install, zero-port, zero-config. An HTTP server has non-trivial startup.

Why HTTP won for eyehands

Three reasons.

Persistent state

eyehands has a background frame buffer capturing the screen at 20 fps. The first call to /find?text=OK triggers EasyOCR’s 3-second model load, caches it in memory, and returns a result in ~300ms. The second call on the same frame returns in <10ms because the OCR result is memoized against the frame hash.

There’s no way to get that with a subprocess-per-call model. Every invocation would pay the 3-second EasyOCR load. A long-running server amortizes that cost across the whole session.

Language-agnostic access

Claude Code talks to eyehands with curl. So does Cursor. So does a Node.js script. So does a bash one-liner in a CI pipeline. So does my own Python debugging tools. They all hit http://127.0.0.1:7331 and everything just works.

If eyehands were a Python library, only Python agents could use it. If it were an MCP server, only MCP-aware hosts could use it. HTTP is the universal interface.

Natural tool shape for agents

Agents already think in terms of HTTP calls. Claude Code has an HTTP request primitive. Every agent framework has one. “Make a POST to this URL with this JSON body” is a shape every agent already knows. There’s no new protocol to learn, no new tool definition to write — just REST.

The trade-offs

You’re running an extra process

The user has to run eyehands in a terminal. If the server isn’t running, the agent can’t use it. I considered having Claude auto-start it, but that felt wrong — the user should explicitly know what’s running on their machine.

Port conflicts

eyehands uses 7331. If something else is on that port, eyehands won’t start. I handled this with a --port flag and with an auto-kill-previous-instance behavior (if another eyehands is already running, it kills that one and takes the port).

Authentication

A local HTTP server is a privilege-escalation surface — any process on the same machine can send requests to it if you don’t authenticate. eyehands generates a random 32-byte bearer token on first run, saves it to .eyehands-token, and requires it on every endpoint except /ping. This prevents a random subprocess from accessing the API.

There’s also a DNS rebinding attack surface — if I bind to 0.0.0.0 and set a CORS wildcard, a malicious website could trick your browser into making API calls to http://127.0.0.1:7331. eyehands validates the Host header on every request and rejects anything that isn’t exact-match loopback.

Single instance

You probably only want one eyehands running at a time — otherwise the frame buffer duplicates memory and you get confused about which instance has the token. eyehands enforces single-instance via a PID file and kills any previous instance on startup.

Design decisions I’m glad I made

Bearer token auth from day one

Not adding auth “for now” and “adding it later” is a trap. I added bearer token auth in 1.0 and I’ve never regretted it. The attack surface is real — any local process can talk to an unauthenticated HTTP server.

Token to .eyehands-token file

I debated CLI flags vs env vars vs a token file. The file won because:

/ping exempt from auth

One unauthenticated endpoint for health checks. Claude can call /ping to see if the server is running before requesting the token. No security implications — /ping doesn’t expose anything useful.

Host header validation

A cheap defense against DNS rebinding attacks. One function: _host_header_ok(host) that exact-matches loopback literals. Took 30 minutes to write and covers a real attack class.

Frame-hash ETag on /latest

HTTP standards for the win. If-None-Match is already a thing every HTTP client knows how to send. Implementing it as X-Frame-Hash headers on the server and standard ETag semantics on the client means every language gets polling-for-change for free.

Design decisions I’d revisit

Binding to a random port

I hardcoded 7331 because it’s memorable. But a random port with the number stored in .eyehands-port would be more robust against conflicts. Still considering this for 1.7.

Per-endpoint request schemas

I’m using ad-hoc JSON schemas per endpoint. A formal OpenAPI spec would be nicer for tooling but it’s a lot of ceremony for a small surface.

When HTTP is the wrong shape

You need <1ms latency. HTTP has 1-5ms overhead even on localhost. A Python library would be faster.

You’re on a platform without sockets. Not a real concern for desktop Windows but could be for embedded contexts.

You don’t have state to share across calls. Then a CLI is simpler and you don’t need the server.

You’re only targeting one agent framework. Then a library or native tool for that framework is probably better than a separate process.

For desktop automation on Windows for AI agents: HTTP wins on every dimension. I’d make the same call again.

Install

pip install eyehands
eyehands

*eyehands has been running as an HTTP server in my own setup for about six months. The architecture is boring in the best way — it just works, it's easy to reason about, and it plays nicely with every agent framework I've tried.*

Give Claude eyes and hands on Windows

eyehands is a local HTTP server for screen capture, mouse control, and keyboard input. Open source with a Pro tier.

Try eyehands