← All decisions jacob@stephens.page
Decision Record

Pull-based health probing over push agents for a small fleet

ADR 0007 · Accepted ยท in production · ~300 words

Context

A modest fleet - a web host, a database, a cron/worker box, a handful of app hosts - needs at-a-glance health visibility. The heavyweight default is a monitoring stack with an agent installed on every node, each pushing metrics to a central time-series store. For a fleet this size that's a lot of moving parts

failure domain - to answer a question that is mostly "is each service up and within thresholds?"

Decision

A single small service on the coordination host that pulls: it probes each server's health endpoints and ports on a fixed interval and renders one page. No agent runs on the monitored hosts - the probes use the network access the coordination host already has. The little state it keeps is local and single-node (see ADR 0008).

Consequences

When I'd revisit

Rich per-process metrics, high-resolution history, or real alerting SLAs tip the balance toward push-based agents and a proper time-series database. Needing to see health from many network vantage points tips it toward distributed probing. At this fleet size and with one trusted operator host, neither yet pays for itself.

One of a set of architecture decision records. Source markdown lives in the infrastructure-patterns repo, which is the canonical copy.