Notes · Infrastructure

Letting an AI Agent Modernize a Frozen Legacy Site, Safely

June 5, 2026 · ~1080 words · Safe AI agents on a live legacy site

You can hand a live, client-owned website to an AI coding agent, let it refactor the markup at speed, and have the client never see a regression you did not intend. The trick is not a smarter agent. It is six boundaries that make the agent's reach safe by construction, so "an agent edited my production site" goes from reckless to routine. I built this while modernizing coachscall.org, a real client site.

The job: a website nobody can safely read

coachscall.org is a frozen front end: a Gatsby static export, compiled HTML, CSS, and JavaScript, with the source long gone. It still needs changes, it is owned by a non-technical client, and it is served edit-in-place from the live document root, no build step and no staging by default. Each page is 440 to 500 KB of inlined, machine-generated markup, full of sharp edges: en-spaces masquerading as regular spaces, the   entity next to a literal U+00A0, drifting indentation, duplicated IDs, attribute casing the browser silently rewrites. Hand-editing it is archaeology, and every structural change risks a visual regression you will not notice until the client does.

Now add an AI coding agent doing the edits. That multiplies throughput, and it multiplies the blast radius by the same factor. So the question I had to answer before letting it touch anything was specific: what boundaries make it safe to let an agent modernize a live, client-owned legacy site, fast, without the client ever eating a regression I did not intend?

Six boundaries, not trust

Safety here does not come from asking the agent nicely. It comes from the shape of the system around it. None of these six boundaries is novel on its own; the point is that all six together change the risk profile. The first five are the standard agent-safety checklist. The sixth is the one that tames opaque, frozen markup.

#	Boundary	How it works on coachscall.org
1	Isolation	Work happens in a `git worktree` on a `refactor/*` branch, served from a separate staging subdomain (`coachscall.stephens.page`, its own vhost and cert, `noindex`). Production stays on `master` in its own document root and is never touched until promotion.
2	Data	Secrets and runtime state are gitignored and unreachable: `private/` is `Deny from all`; the editor's writable data directory is setgid, group-owned by the web group, least privilege; staging gets its own environment, so internal links stay on staging.
3	Command and scope	The agent only edits files in the worktree. Edits are surgical and byte-aware (exact-string and verified-range replacements), never a broad sed over everything. No destructive operation touches prod.
4	Human gate	Promotion is a human decision: the client reviews the staging URL and approves; then a `git merge` of the branch into `master` is the deploy. The non-technical owner is the merge gate.
5	Audit	Every change is an atomic, signed commit with a plain-English message. The entire modernization is a reviewable diff history, not a mystery.
6	Rollback and visual-regression gate	Before promotion: delete the branch, and prod never moved. After: one `git revert`. And the gate that makes opaque-markup refactors safe, pixel-diff every page, staging versus prod, target AE 0, before a human is ever asked to look.

The boundary that does the real work: measure pixels, not bytes

Refactoring frozen, machine-generated HTML is dangerous for one reason: you cannot read it well enough to be sure a change is inert. A diff that looks alarming often renders identically, and a diff that looks harmless can shift the layout. So I stopped certifying changes by reading them and measured the result instead. Render each page on production and on staging at fixed viewports, compare with an absolute-error pixel metric, and treat AE 0 as the contract: provably identical to what is live. Only after the pixel diff is clean does a human get asked to approve.

This caught two things during the coachscall.org refactor that reading the bytes would have missed.

Extracting the duplicated footer into one shared partial produced a non-zero diff on the home page only. The pixels showed why: the home page's copy of the footer carried a stray space (Call </a> instead of Call</a>) that the other five pages did not, nudging the separators about 1 pixel. The "regression" was the refactor removing a pre-existing inconsistency, exactly the kind of call you want a human to make on evidence, not a guess.

A 4,000-pixel difference on the About page turned out to be a webfont-load race in the screenshotter, not a content change. Re-shot with the fonts settled, it went to AE 0. The gate tells a real change apart from measurement noise. The per-page active-nav markers (aria-current="page") were preserved by parameterizing the shared partial, and the same gate verified each page still emitted byte-for-byte what it did before.

What it costs, and when it is worth it

Costs, plainly: standing up the staging mirror and the diff harness is about an hour of one-time infrastructure per site. For a single throwaway edit, that is overkill. For ongoing modernization of a site that carries revenue or reputation, it pays for itself the first time the gate catches something. The honest limit is that the pixel gate proves visual equality and nothing else. A dropped aria-current, a broken link, a change in JavaScript behavior, none of those show up as a pixel. Pixels are necessary, not sufficient; semantics still need their own checks.

What you buy is the thing that matters. The owner can move fast without trusting the agent blindly: they approve a visible, isolated preview, and never absorb an unintended regression. Reversibility is total and cheap at every stage. And the work is more legible than most human-run deploys, because signed atomic commits plus a visual-diff record is a better audit trail than "I changed some HTML and it looked fine to me."

The checklist, if you want to steal it

Before letting an agent modernize a live legacy site:

Isolate. Branch, worktree, staging host; prod untouched.
Wall off data. Secrets gitignored, runtime directory least-privilege, staging gets its own config.
Keep edits surgical. Exact, verifiable changes; no blast-radius commands.
Gate promotion on a human reviewing an isolated preview.
Make every change an atomic, signed, plain-English commit.
Pixel-diff staging versus prod to AE 0 before a human is asked to look; treat any non-zero result as a finding to explain, not a number to wave through.

Promotion is a merge. Rollback is a revert. The client sees nothing until it is proven.

I keep the decision-and-tradeoffs version of the sixth boundary as an architecture decision record, ADR 0010, the pixel-equality gate.

The open edge I am still working on: this scheme certifies that nothing visible changed, the right contract for modernizing frozen markup, but says nothing about whether the change was a good idea. A refactor can be pixel-perfect and still make the next change harder. Measuring the pixels was the easy part. Knowing which refactors are worth doing to markup you have decided to keep alive is the part I do not have a metric for yet.

If you are sitting on a frozen site that still has to change and you would rather not gamble on an agent, this is the kind of work I do. You can reach me at jacob@stephens.page.