← All decisions jacob@stephens.page
Decision Record

A pull-based metrics stack over a bespoke prober, with the data plane kept private

ADR 0011 · Accepted ยท in production · ~537 words

Context

ADR 0007 chose a single bespoke service that pulls health from the fleet and renders one page, and named when that would stop being enough: "rich per-process metrics, high-resolution history, or real alerting SLAs tip the balance toward push-based agents and a proper time-series database."

All three arrived at once. A new revenue-adjacent service needed per-request metrics (rate, errors, latency by route), the fleet outgrew "is it up?", and "find out by looking at the page" needed to become "get paged." The bespoke prober had reached its design limit.

The off-the-shelf answer is a metrics stack: Prometheus to scrape and store, Grafana to visualize, Alertmanager to route notifications, plus exporters for host and black-box signals. Every one of those components ships with no authentication. On a host without a network firewall, the default bind-to-all-interfaces means anyone who finds the port can read every metric you collect - and the Alertmanager API lets them silence or forge your alerts. Adopting the stack is half the decision; containing it is the other half.

Decision

Adopt the stack, but keep the pull discipline and lock the data plane to loopback:

Consequences

When I'd revisit

A multi-node fleet where one Prometheus can't scrape everything tips toward federation or remote-write. A compliance or team-size requirement to browse the internal UIs without tunnels tips toward an authenticating proxy rather than loopback. And needing reachability from several network vantage points tips the black-box probing toward distributed agents - the same boundary 0007 named.

Narrative writeup: What It Costs to Watch Your AI Coding Agents. One of a set of architecture decision records. Source markdown lives in the infrastructure-patterns repo, which is the canonical copy.