Human-in-the-Loop AI Agents for Legacy Business Systems
Over the past year I gave non-technical managers at a 40-person travel operator direct AI-agent access to a 25-year-old PHP/MySQL ERP that processes reservations, payments, and expense flows - and 14 manager-prototyped features have shipped to production with zero agent-caused incidents. The interesting part is not the AI. It is the set of boundaries that made it safe to say yes.
We are a two-engineer company, and the infrastructure - servers, deploy pipeline, database, and all the agent access described here - is mine. The stakes are personal: if an agent corrupts the reservations table, I restore backups at 2 a.m., and the people whose paychecks route through these systems are my coworkers. No platform team, real money, one owner's attention: that is the constraint most small businesses face when they let AI touch their systems of record. This post is the pattern I built under it, written so you can copy it.
The dangerous combination
Legacy business systems and AI agents are individually manageable and jointly dangerous. The legacy system has no test coverage to speak of, no staging that mirrors production, and decades of implicit behavior - the kind where a column named Status means four different things depending on a second column. The agent is confident, fast, and tireless. Give it production credentials and it will cheerfully "fix" things at a rate no human can review.
Both naive responses are wrong. "Never let AI near the ERP" forfeits real value. "Trust the model, it's good now" mistakes capability for alignment with your definition of safe. The model is very good. It is also not the one who answers to your coworkers when payroll is wrong.
The correct response is architectural: assume the agent will eventually do something wrong, and build the system so that the wrong thing is bounded, visible, and reversible.
Six boundaries, one diagram
Every agent in this system operates inside six concentric boundaries. A seventh, which is about people rather than processes, gets its own section below.
┌─────────────────────────────────────────────┐
│ 6 ROLLBACK deploy gate + healthcheck + │
│ revert path + kill switch │
│ ┌─────────────────────────────────────────┐ │
│ │ 5 AUDIT every query attributable to │ │
│ │ an identity + host + time │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ 4 CODE PROMOTION agent proposes, │ │ │
│ │ │ human merges - always │ │ │
│ │ │ ┌─────────────────────────────────┐│ │ │
│ │ │ │ 3 COMMAND SURFACE default-deny ││ │ │
│ │ │ │ ┌─────────────────────────────┐ ││ │ │
│ │ │ │ │ 2 DATA snapshot copy, │ ││ │ │
│ │ │ │ │ least-privilege DB user │ ││ │ │
│ │ │ │ │ ┌─────────────────────────┐│ ││ │ │
│ │ │ │ │ │ 1 ISOLATION one agent, ││ ││ │ │
│ │ │ │ │ │ one container, one ││ ││ │ │
│ │ │ │ │ │ role, resource limits ││ ││ │ │
│ │ │ │ │ └─────────────────────────┘│ ││ │ │
│ │ │ │ └─────────────────────────────┘ ││ │ │
│ │ │ └─────────────────────────────────┘│ │ │
│ │ └─────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
1 - Isolation. Each manager gets their own container running their own agent session against their own copy of the application: one agent, one container, one human role. CPU and memory limits mean a runaway loop degrades one sandbox, not the host. This is the cheapest boundary to build and the one that makes every other boundary enforceable, because "the agent" is now a specific OS-level identity you can constrain, meter, and kill.
2 - Data. Agents never see the production database. Each sandbox points at a snapshot copy, refreshed every few hours, through a least-privilege database user whose grants are pinned to the sandbox host. A leaked credential buys read access to hours-old copy data on one machine: a contained, survivable failure.
3 - Command surface. Default-deny. The agent's shell allowlist names the commands it may run; everything else is refused. It cannot reach package managers, open arbitrary outbound connections, or SSH onward to other hosts. When a new legitimate need appears, a human adds the specific command - the surface only grows by deliberate decision, never by agent initiative.
4 - Code promotion. The load-bearing boundary. Agents propose changes as pull requests; a human reviews and merges - always, no exceptions for "trivial" changes, because the agent does not get to decide what is trivial. Code reaches production only through the same gated, healthchecked pipeline a human uses. The agent's work product is a proposal, never an action on the system of record.
5 - Audit. Every database connection is attributable: which identity, which host, what time. Agent transcripts persist, so "what did the agent do and why" is answerable after the fact. One rule has paid for itself: make the automated identity connect remotely under its own user even when a local socket would work, because local sockets inherit ambient privileges and blur the attribution you will want during an incident.
6 - Rollback. Deploys run through a script with a post-deploy healthcheck and a documented revert path. The kill switch for any agent is one command: stop the container. Schema changes use online DDL with the rollback statement written down before the change is applied. If you cannot say in one sentence how you would undo an automation, it is not ready.
The "do not automate" line
One rule survives every iteration, and I would adopt it before any of the technology: no agent approves, merges, deploys, pays, hires, fires, disciplines, denies service, or materially alters revenue-critical data without a named human gate. Write the list down and publish it where the people affected can read it. It converts "we use AI responsibly" from a slogan into a checkable claim: anyone can ask "who was the named human on this change?" and the audit boundary means there is always an answer.
What this actually shipped
Managers who used to queue feature requests through me now prototype directly: they describe what they want, the agent builds it against their sandbox, they iterate until it behaves, and I review a working pull request instead of a wish-list email. Fourteen of those features have shipped, and the quality of my review queue went up - a PR already exercised against realistic data beats a spec written from memory.
The boundaries earned their keep in the boring way: not by preventing a dramatic incident, but by making whole categories of incident structurally impossible, so that my review attention - the scarcest resource in a shop with no platform team - goes to "is this feature right" instead of "could this destroy something."
The hard tradeoff: the human gate is a bottleneck, on purpose
Honesty requires naming the cost. Boundary 4 routes every change through one person's review, and that person is me. The gate caps throughput: a manager whose prototype works on Tuesday may wait until Thursday for production. During my vacation, the queue simply holds. Fourteen features in a year is the number with that latency included.
I keep the gate anyway, for a reason easy to underrate: review-at-promotion is the only boundary that checks intent, not just blast radius. Boundaries 1-3 limit what a wrong action can damage; 5 tells you what happened; 6 lets you undo it. Only the human gate asks "should this change exist at all" - and where the data is people's reservations and money, I want that asked by someone who can be embarrassed at the staff meeting. When the latency genuinely hurts, the fix is a second reviewer, not a softer gate.
When I'd choose differently
This pattern is tuned for a specific situation: revenue-critical legacy state, sensitive data, and a team too small for dedicated platform engineering. Change the situation and the design should change with it.
- Greenfield SaaS with real test coverage and staging: let agents run more autonomously against ephemeral environments and lean on CI and evals instead of per-change human review. The merge gate earns its cost from the absence of a safety net; if you have the net, the gate can loosen.
- Non-sensitive data: the snapshot-copy boundary matters much less. Live read replicas are simpler and fresher.
- A real platform team: graduated autonomy - agents earn wider permissions per-task-class as eval evidence accumulates - beats my static allowlists.
- No one available to review at all: then do not deploy agents against the system of record, full stop. The pattern's floor is one accountable human; below that floor, the honest answer is "not yet."
The labor note
First, downstream: this automation changed my coworkers' work, by design. Managers went from supplicants in a feature queue to builders of their own tools - more agency, not less. Nothing here surveils, scores, or paces an employee. If your version does otherwise, you have built a different thing and should say so.
Second, upstream: the agents run on frontier models trained and tuned with substantial human labor - data annotation and safety work whose conditions and wages I cannot verify from vendor disclosures. "Cheap magic" is not an honest description of what powers these systems, and operators adopting them should at least know what they cannot currently know.
The copyable checklist
Before any AI agent touches a business system you are responsible for:
- Isolation - one agent, one container, one human role; resource limits set.
- Data - copy or replica, never production; least-privilege DB user; credentials pinned to host.
- Command surface - default-deny allowlist; grows only by human decision.
- Code promotion - agent proposes, named human merges; no exceptions.
- Audit - every action attributable to identity + host + time; transcripts kept.
- Rollback - healthchecked deploys; revert path written before the change; one-command kill switch.
- People - the "do not automate" list published; downstream work changed toward agency, not surveillance; upstream labor opacity acknowledged.
If you run a small business on an old system and the AI vendors are telling you it is all upside, the checklist above is what the upside actually costs to capture safely. In my experience the cost is modest, mostly paid once, and dramatically smaller than one bad incident.
What I have not solved: the human gate does not scale past a reviewer or two, my allowlists are maintained by hand, and I still lack a principled way to decide when an agent has earned a wider surface rather than just not having failed yet. If you have run graduated autonomy against a legacy system with real consequences, I would genuinely like to hear what works.
A shorter companion note on the sandbox layer specifically is at Sandboxing AI Agents per Business Role.
The individual boundaries are each written up with their trade-offs as decision records: per-tenant isolation (0001), snapshot-fed sandboxes (0003), a scoped agent identity (0005), default-deny data access (0009), and the pixel-equality gate for generated markup (0010).