Notes · Observability

What It Costs to Watch Your AI Coding Agents

June 16, 2026 · ~1,400 words · Instrumenting a fleet of AI coding agents with OpenTelemetry

We run Claude Code across about fifteen Linux hosts, and until last week I had no idea how much we used it. This post is how I closed that gap with OpenTelemetry, and the three surprises along the way: the cost number is a polite fiction, the dashboard lied about being empty, and the metric I most wanted to count was uncountable by default.

The short version: one OpenTelemetry collector feeding the Prometheus and Grafana stack we already ran, plus a Tempo trace store for the apps. The interesting part is the places the obvious setup is quietly wrong.

The setup, in one paragraph

Claude Code emits OpenTelemetry metrics natively - set a handful of environment variables and it ships token counts, session counts, cost, and active time over OTLP. I pointed all of it at a single OpenTelemetry Collector on our monitoring host. The collector re-exposes those metrics on a localhost endpoint the existing Prometheus scrapes, so the pull-based stack just gained one scrape target. For application traces I added Grafana Tempo behind the same collector. Total new infrastructure: one collector process and one trace store.

The reason to put a collector in the middle, rather than wire each emitter straight at Prometheus, is that it is a seam. The day you add a second signal or source, it is a config change, not a re-architecture. That bet paid off within the same week, and dredged up an older question I thought I had buried.

Surprise 1: the cost metric is notional

Claude Code reports a cost.usage metric in US dollars, and the first instinct is to treat it as a bill. It is not. Our runs are subscription-based, not API-billed, but the metric is computed from the vendor's API list prices regardless. That number is an estimate of what the tokens would have cost, not money anyone is charged.

This is genuinely useful - the right unit for comparing host against host, model against model, week against week - but only if nobody mistakes it for accounting. I labelled the metric and every dashboard panel "notional." Build an alert or finance report on it without that caveat and you will eventually have a confusing conversation with someone holding the actual invoice.

Surprise 2: the dashboard looked broken, and was right to

I rolled the telemetry config out to every host, opened the dashboard, used Claude Code, and saw nothing. Zero tokens, zero sessions, "No data." The pipeline was fine. Two separate things were hiding it.

The first is how the agent reads its config: the telemetry environment is read once, at process startup. Every session already running when I deployed the settings kept running without it; new sessions exported within a minute. Worth saying out loud any time you enable telemetry on long-lived processes, because the natural conclusion - "it's broken" - is exactly wrong.

The second was mine. The collector dropped any metric series five minutes after it stopped receiving it. For a fleet used in bursts, the series evaporated between sessions and the panels read zero whenever no one was mid-session. A working pipeline that looks dead most of the day trains you to ignore it. The fix was to extend the series expiration to seven days, which also matters for the next surprise.

Surprise 3: the metric I wanted to count was uncountable

I wanted the simplest possible number: how many sessions, by host, by person. Claude Code has a session.count metric, so this looked free.

It was zero. Always zero, no matter how many sessions ran.

The cause was a labelling decision I had made for a good reason that turned out exactly wrong. To keep Prometheus cardinality bounded, I had disabled the per-session id label. Without it, every short-lived session process reports the value "1" onto the same collapsed time series. The counter never accumulates, so summing it or taking its rate gives nothing. Tokens survived only because their values differ between sessions, and Prometheus's increase() reconstructs the total through what it reads as counter resets. A counter that is always 1 has no resets and no increases. It is, for counting purposes, mute.

The fix was to turn the session id back on, made safe by the seven-day series expiration I had just added: it bounds the cardinality to a rolling seven-day window of sessions, which for our fleet is small. One change solved two problems, and the lesson is the uncomfortable kind: the "responsible" optimization silently broke the headline metric. Count sessions as distinct session ids, never as a counter increase.

The legacy app fought back

Metrics were the easy half. Traces were where the apps differed.

The modern services were nearly free. A FastAPI app takes one wrapper - opentelemetry-instrument in front of the start command - and you get request, database, and outbound HTTP spans with no code change. An LLM proxy we run had a built-in OpenTelemetry callback; one config line and its per-call spans appeared in Tempo next to the app that called it.

The legacy PHP application was the hard case. PHP needs a C extension built before any of this works, and with no background process to batch exports, each request pays the export cost at shutdown. I got it working on a staging host that mirrors production: extension built, the SDK and an OTLP exporter installed, a small fail-safe bootstrap wrapped so telemetry can never break a page. Request-level traces showed up.

Then I looked inside the traces, and there was almost nothing. One span: the request. The reason is a number I had to go count: the app uses the older mysqli database interface in 278 files and the newer PDO interface in 69. There is auto-instrumentation for PDO and none for mysqli. So out of the box the dominant database layer - four call sites in five - is invisible. Full visibility there would mean writing custom mysqli hooks or migrating the call sites, and the per-request export overhead matters far more in production than on staging. That is its own decision, and I have left it as one.

What I would tell myself a week ago

Put the collector in the middle even with one source today; the second arrives faster than you think. Treat any vendor cost metric as notional until proven otherwise. Expect long-lived processes to ignore config they were not started with. And be suspicious of your own cardinality cleverness - the label you drop to be responsible may be the one the headline number depends on.

One caveat I still cannot resolve cleanly: the cost metric is the single number an executive will want, and the one I trust least. I can caption it "notional" on a dashboard I control, but a number travels, quoted without its caption. I have no good answer for shipping a figure that is useful and wrong at once, beyond repeating the caveat until it sticks.

I tried to answer this once, with Google Analytics

Years ago I pasted the Google Analytics tag across a lot of our internal app's pages, because I wanted to know how people used the app and where the traffic went. Then I never looked. The data piled up for years behind a login I almost never opened. So when the traces started flowing last week, the honest question was "did I finally get the thing I wanted back then?"

Half of it. Tracing and product analytics overlap on exactly one thing: raw page popularity. Because the app renders on the server, the count of spans per route is a fair proxy for pageviews, and the span timing tells me which pages are slow, which Google Analytics never did at all. So Tempo hands me the "where is the traffic, and which pages drag" half of my old intent for free, a side effect of instrumenting for performance.

The half I actually said I cared about, tracing cannot give. A trace has no user, no session, no journey, no referrer. That is product analytics, a different layer, and the honest tool is something self-hosted and cookieless like Plausible or Matomo. I have not stood one up.

But the tooling was never the problem. I did not leave the Google Analytics data unread because it was the wrong tool. I left it unread because I never had a question I was going to act on. Telemetry does not fix that. A graph no one has decided to act on is the same unread data in a brighter font, and last week I built a lot of brighter fonts. The extension compiles, the spans flow, the dashboards are green. Whether any of it changes one thing I do is still up to me, and that is the one part of this stack with no package to install.

The decisions underneath this are recorded as ADR 0021, the shared collector seam, building on ADR 0011, the pull-based metrics stack.

Drafted from my working notes and the actual rollout with help from Claude. The decisions and the work are mine; the prose is collaborative.