← All decisions jacob@stephens.page
Decision Record

A shared OpenTelemetry collector seam over direct backend wiring, metrics-only for Claude Code

ADR 0021 · Accepted ยท in production (Phases 1-3 live across the fleet; Phase 4, PHP app tracing, prototyped on a staging host only) · ~1186 words

Context

ADR 0011 stood up the standard metrics stack (Prometheus, Alertmanager, Grafana, per-node exporters) for infrastructure health: hosts, databases, endpoints, certificates. That answers "is the fleet healthy," not what the fleet actually does.

The largest unmeasured activity is the AI coding agent itself. Engineers run it interactively in tmux on roughly ten hosts, and an automation host runs its own long-lived agent sessions. None of it was instrumented: no view of token consumption, session counts, model mix, or relative cost across hosts and people. The runs are subscription-based rather than API-billed, so there is no invoice to read either. Separately, the in-house apps emit no traces, so when a request is slow we infer rather than observe.

OpenTelemetry closes both gaps with one vendor-neutral pipeline: the agent emits OTEL metrics natively, the apps take OTEL SDKs. This ADR settles how to receive, store, and surface that telemetry without rebuilding what ADR 0011 provides.

Decision

Introduce a single OpenTelemetry Collector on the existing monitoring host as the shared ingestion seam, feeding the Prometheus and Grafana stack already there. The collector decouples what emits telemetry from what stores it, so new sources (the agent now, apps later) and new backends (a traces store later) are configuration changes, not new architecture.

For application tracing the same collector gains a traces pipeline feeding a traces store (Grafana Tempo, the natural fit alongside Grafana); apps point their OTEL SDKs at the same collector.

Phased rollout

  1. 1. Collector plus agent metrics on the automation host. Verified end to end: session, token, cost, and active-time series in Prometheus, labelled by host and model.
  2. 2. Agent metrics across the fleet. Managed settings deployed to all ten agent-running hosts; OTLP reachability confirmed from each; a starter Grafana dashboard is live.
  3. 3. Application tracing. The monitoring host was resized in place (no IP change) to give Tempo headroom. Tempo runs monolithic with local-disk blocks and 7-day retention, fed OTLP traces by the collector and wired as a Grafana datasource. Two FastAPI services are auto-instrumented via opentelemetry-instrument, and an LLM-proxy in front of one emits spans through its built-in otel callback. Traces are confirmed flowing and queryable.
  4. 4. Legacy PHP app tracing. Prototyped on a staging host that mirrors production. The opentelemetry PECL extension is built and loaded; the OTEL PHP SDK, OTLP exporter, and PDO auto-instrumentation come in via Composer; a fail-safe auto_prepend_file bootstrap opens a SERVER root span per request; php-fpm passes the OTEL env (http/protobuf, since there is no grpc PHP extension). Request-level traces are confirmed.

Key finding for the production decision: the app is overwhelmingly mysqli (roughly four times as many call sites as PDO), which has no official auto-instrumentation, so out of the box we get DB children only on PDO paths; full query-level visibility would need custom mysqli hooks via the extension's hook API, or migrating those call sites. The prepend load also adds per-request overhead that matters more in production than on staging. Both make the production rollout a separate, deliberate decision, not a copy-paste.

Alternatives considered

Consequences

Follow-ups

Narrative writeup: What It Costs to Watch Your AI Coding Agents. One of a set of architecture decision records. Source markdown lives in the infrastructure-patterns repo, which is the canonical copy.