If you've instrumented an LLM system recently, you're likely already aware of how much telemetry the observability stack can offer you: latency per completion call, token throughput, retrieval latency from your vector store, and error rates across every service boundary.
The instrumentation works, the data flows, the dashboards populate, and then…your system hallucinates confidently on every third request, with all of the metrics above remaining stubbornly silent about it.
This is the measurement problem that traditional observability engineering doesn't solve because it was designed for a different question. Traditional observability asks: is the system up, is it fast, is it throwing errors? Those are threshold questions against deterministic behavior. AI systems require a different question entirely: is the output any good? That's much more subtle than monitoring a threshold, requiring continuous, context-dependent judgment that doesn't map onto latency percentiles or 5xx rates.
The previous post in this series covered how OTel collects and propagates signals through AI systems. This one covers what to measure, and why those specific things should be your focus.
One thing before we go any further: the argument here isn't that infrastructure signals are now irrelevant. GPU utilization, memory pressure, service latency, error rates, and the panoply of other numbers you track remain both valid and necessary. What changes is that those signals are not enough, on their own. AI observability extends traditional observability with new signal categories that have no real precedent in deterministic software.
With that in mind, let’s get into the metrics side of AI observability engineering, including cost attribution, tracking issues like prompt regression, AI agent monitoring, and all the rest.
People have different ways of defining the canonical observability signal types, but metrics, logs, and traces are a standard way of doing it. These continue to work as designed in AI agent systems, with traces capturing every span from user request to completion response, logs recording the documents your retrieval pipeline returned, metrics tracking your inference endpoint's p99 latency, and so on.
These mechanisms are sound, but they lack the resolution to detect the quality of reasoning that occurred between input and output. Consider, for example:
This is a structural gap; traditional signals measure the mechanism of computation (did the function execute, how long did it take, what did it return), but AI signals have to measure the output of reasoning (was the response correct, was it grounded, was it appropriate). These are different problems, and they require different instrumentation.
This is why evaluation has emerged as a distinct layer in the AI observability stack. It’s not a replacement for tracing, it’s a required complement to it. More on that shortly.
In traditional software, compute cost is generally a billing artifact. You pay at the end of the month, review it quarterly, and optimize when this particular budget item gets uncomfortable. In LLM systems, cost is a per-request operational signal, and that distinction matters in ways that might not be obvious at first.
Recall that LLM providers charge per token. Every completion call has immediate cost attribution (usually prompt tokens plus completion tokens), which is priced per model, per provider, per request. At low volume, this is a footnote, but at production scale, it balloons into a first-class operational concern requiring the same instrumentation discipline you'd apply to any other critical metric.
The signals worth tracking include:
That last point reflects something practitioners discover quickly in production: cost anomalies are often failure signals before quality metrics catch them. An agent loop that runs longer than expected — thereby consuming more tokens across more steps than it should — might indicate a reasoning problem upstream. The cost spike is the early warning; the quality degradation follows. Treating cost as a pure billing concern means missing this diagnostic signal entirely.
For teams running multi-model setups, a normalized cost layer across providers simplifies attribution considerably — otherwise you're reconciling per-token pricing across different provider schemas, which is not much fun.
Quality, of course, is harder to measure than cost. Cost is a number; quality is a judgment. But "harder to measure" doesn't mean "impossible to operationalize," of course, and the patterns for doing so have become fairly well-established across system types.
Quality signals vary by what kind of system you're instrumenting:
AI hallucination deserves specific treatment because it's not a uniform failure mode. A fabricated proper noun in a customer-facing response is categorically different from a reasonable extrapolation in an analytical context; a confident misattribution in a regulated domain is different again. Detection approaches — retrieval-grounded scoring where you have reference documents, LLM-as-judge evaluation for open-ended output, reference comparisons where ground truth exists — have different applicability depending on which failure mode you're most exposed to.
Semantic drift and model degradation make all of this trickier still, because they mean that output quality isn't static. Models receive updates, prompts age against shifting user behavior, fine-tuned models drift from their evaluation distributions over time, etc. Quality metrics tracked as point-in-time snapshots will miss gradual degradation that becomes obvious only in retrospect. This is where the monitoring-to-observability distinction becomes concrete: monitoring tells you today's quality score; observability lets you see the trend, identify when it changed, and trace it back to what caused it.
This points to a measurement principle that practitioners working across AI observability tend to converge on independently: performance, cost, and quality must be tracked together, because they trade against each other (i.e., switching to a smaller model to reduce latency typically degrades output quality). The signals are interdependent, and the instrumentation has to treat them that way.
Safety signals are distinct from quality signals in a way that's worth making explicit. Quality asks whether an output is good; safety asks whether it's appropriate, unbiased, and traceable. These aren't the same question, and they don't share the same instrumentation.
The signals required to engineer trust in AI observability engineering include:
The upshot of all this is that teams deploying AI in production are navigating a genuine trust gap. The path from "human reviews everything" to graduated, policy-based trust runs through auditability because you can't extend trust to a system whose behavior you can't retrospectively verify. Closing that gap is an observability problem.
Traditional software correctness is binary: the function returns the right value or it doesn't. The test suite tells you whether it passed. AI output correctness is probabilistic and context-dependent; the test suite model doesn't transfer.
AI evaluation frameworks such as Ragas, LLM-as-judge patterns, and hand-rolled scoring pipelines have emerged as the practical response. The pattern revolves around defining evaluation criteria appropriate to your use case, running assessments continuously against live or sampled output, and feeding those results back as operational signals alongside the infrastructure metrics you're already collecting.
This also points to what makes agentic systems the hardest observability problem in the current stack — evaluation has to occur at the “reasoning” level, meaning across chains of decisions with no single point of failure to isolate. That's the subject of Part 5.
Bringing these three signal categories together — cost, quality, safety — doesn't just make AI observability wider than traditional observability (more layers to instrument), it also makes it deeper: new signal types at every layer, including signals that require interpretation rather than measurement.
What this means is that teams approaching AI observability as "add dashboards to existing monitoring" might well end up with fast, low-error systems that hallucinate reliably, accumulate unpredictable costs, and produce outputs they can't retrospectively audit. Teams that instrument cost, quality, and safety as first-class operational signals have something more useful: the foundation for knowing which problems actually deserve attention.
That's where the instrumentation story leads. Once you have these signals, the next question is prioritization: you can't investigate every quality degradation event or cost anomaly simultaneously; deciding what surfaces first, what can wait, and what the cost of waiting is.
That problem has a shape, and it's what the rest of this series is built around.