ETA Platform - Case Study | Jacob Stephens

System overview

How the pieces connect. (Architecture shown at a deliberately high level; hostnames, addresses, and vendor specifics omitted.)

The arc

Inherited legacy, modernized in production, then extended.

The business runs on Tourbot - a full-lifecycle group-travel ERP handling the sales pipeline, itinerary building, reservations, passenger logistics, payments, communications, and 70-plus reports, plus three customer- and contractor-facing portals on the same core. When I took ownership it was a PHP 5.6 / CentOS 7 / MySQL 5.7 system whose support runway was running out, and every other part of the business depended on it staying up.

Over four years I led the modernization in production, without a freeze: an enterprise-wide migration to PHP 8, MySQL 8.4, and Rocky Linux 9 across the codebase and server fleet, while still shipping features. That baseline unblocked a second wave that wasn't possible on the old stack - containerized per-manager AI sandboxes, a Python agent-orchestration VM, a fleet-wide Prometheus/Grafana/Alertmanager metrics stack, and ProspectForge (a built, review-ready in-house replacement for the CivicIQ prospecting SaaS, pending management approval) - alongside the unglamorous habits that keep a single-engineer platform safe: a guarded server-side deploy pipeline, a PHPUnit/CI suite across four PHP versions, and a file-based cron registry with a drift monitor watching 100-plus scheduled jobs.

The throughline is an engineer who inherited a legacy system and systematically turned it into a platform - more common and more credible than a string of greenfield builds.

By the numbers

Selected, verifiable outcomes from the work below.

2,650 → 183SQL statements per manifest render

~1smanifest load, down from 5-7s

4user-facing portals on one shared core

14manager-prototyped features shipped to production

77field guides on the guides PWA this season

3,300+traveler SMS sent from the field

~$154Kguide expenses reconciled this season

100+scheduled jobs under drift monitoring

14hosts instrumented with Prometheus exporters

~80%of measured DB query time removed by three indexes

Engineering highlights

Five representative pieces, each problem → approach → outcome.

Manifest performance: 2,650 → 183 SQL statements/request

Problem. The group-manifest page - printed before every trip - took 5-7s and fired ~2,650 SQL statements per render from per-row constructor fan-out. Approach. Six profile-driven rounds: batched in-request prefetch caches, one shared DB handle, and a targeted history-table column + composite index - each verified byte-identical against the original HTML. Outcome. ~1s loads, statements per request down ~93%, validated on a 244-sub-reservation outlier group.

PCI-conscious payments

Problem. Aging card integrations and bank-file workflows needed modernizing while customers kept paying daily. Approach. Rebuilt the card flow across merchant processors with bank-account encryption, and modernized the ACH/NACHA bank-file pipeline over SFTP - kept steady through the platform migration. Outcome. Current PCI posture and reliable money movement, with continuity for in-flight registrations and payments.

Multi-tenant AI assistants for managers

Problem. Non-technical managers wanted to query business data and prototype changes safely, without waiting on developer time. Approach. Per-manager sandboxed Tourbot instances (Docker, Traefik) with isolated databases refreshed every few hours from a production mirror, per-container resource limits, allowlist-constrained (default-deny) agent command execution, and AI work isolated on per-role branches with a manual promotion step. Outcome. Managers explore data and draft workflow changes through conversation; isolation and a human gate keep production safe.

Observability first, then a measured 80% database win

Problem. The bespoke status prober answered "is it up?" but not "why is database CPU at 65%?" - no metrics history, paging logic embedded in a dashboard, and performance work that was anecdotal rather than measured. Approach. Stood up Prometheus, Grafana, and Alertmanager on a Terraform-provisioned droplet - node, mysqld, and blackbox exporters across all 14 hosts (host metrics, replication health, uptime, TLS expiry), with the rule-to-SMS paging chain verified live before cutting over the old pager. Then used performance_schema to rank 27 days of production queries by total time: three full-table-scan hot queries dominated, fixed with three composite indexes applied as online DDL (INPLACE, LOCK=NONE), EXPLAIN-verified, rollback written first. Outcome. One paging pipeline, capacity questions answered from dashboards, and ~80% of measured query time (~404K of ~500K query-seconds) gone - with the remaining load diagnosed as application-level N+1 volume rather than papered over with hardware.

Platform migration, no downtime

Problem. PHP 5.6, MySQL 5.7, and CentOS 7 were all past or near end-of-life under a live business. Approach. Codebase-wide PHP 5→8 migration, a MySQL 5.7→8.4 plan with dual-running databases during cutover, and a CentOS 7→Rocky 9 fleet migration with per-server runbooks. Outcome. A current, supported stack across the fleet - completed without an outage staff or customers noticed.

Stack & why

The reasoning, not just the list.

PHP 8 / MySQL 8 - the platform's core; modernized in place rather than rewritten, because a rewrite would have frozen a business that can't stop.

Docker + Traefik - per-manager AI sandboxes need hard filesystem and database isolation; containers give each tenant its own world cheaply.

Python / Flask + a scoped system user - the agent-orchestration VM runs an autonomous coding agent with least-privilege, auditable credentials it never handles directly.

FastAPI / Next.js / PostgreSQL - ProspectForge is greenfield, so it uses a typed, modern stack independent of the legacy MySQL core.

Server-side deploy script + PHPUnit/CI - for a single-server PHP app, a guarded shell pipeline (dirty-tree, flock, ancestor-only, healthcheck, one-command rollback) beats a hosted runner; CI runs the suite across PHP 8.1-8.4.

Prometheus / Grafana / Alertmanager - capacity questions ("why is DB CPU at 65%?") need metrics history, not a ping dashboard; a dedicated Terraform-provisioned droplet keeps paging out of the app as the single source of paging truth.

File-based cron registry + drift monitor - 100-plus scheduled jobs become legible and tamper-evident instead of a black box.

Open source: I upstreamed the MySQL 8.4 caching_sha2_password auth-handshake fix this stack needed into php-mysql-replication (PR #148).

Architecture decisions: the non-obvious trade-offs - tenant isolation, the data tier, the deploy pipeline, agent identity - are written up as sanitized ADRs in infrastructure-patterns.

Confidentiality: the repositories are private per employer agreement. Architecture is described here at a high level and omits hostnames, network addresses, credential mechanisms, and vendor specifics.

Read further

The reasoning, the résumé, and the rest of the work.

Architecture decisions (ADRs) → Résumé → Full portfolio → Contact →