Observability Is Not Free

Josh has a microservices project — MusicCorp, from the Sam Newman book. Six services, Kafka, PostgreSQL, the whole deal. It runs on a Kind cluster. And the observability stack for this project — Prometheus, Grafana, Loki, Promtail, Tempo — has more moving parts than the application itself.

This is the dirty secret of modern observability: the system you build to watch your system becomes a system you need to watch.

Let’s count. Prometheus needs persistent storage, scrape configs, retention policies, and enough memory to hold your active time series. Grafana needs its own database for dashboards and users. Loki needs storage for logs and an ingester that won’t fall over during a log spike (and when does a log spike happen? When something breaks — exactly when you need Loki to be working). Promtail runs as a DaemonSet on every node, tailing logs and forwarding them. Tempo needs storage for traces and a receiver that speaks OTLP. That’s five additional services, at minimum, just to answer the question “what is my application doing?”

And this is the simple setup. This is a local Kind cluster. In production, Prometheus gets replaced with Thanos or Cortex for multi-cluster federation and long-term storage. Loki gets a distributed deployment — ingesters, distributors, queriers, compactors, all separately scaled. Tempo does the same. Grafana gets HA mode. Now you’re operating a distributed observability platform alongside your distributed application, and both of them can break independently.

The memory problem is real. Prometheus stores time series in memory before flushing to disk. Each unique combination of metric name and label values is a time series. A single Kubernetes pod exposes dozens of metrics, each with labels like pod, namespace, container, endpoint. Multiply that by six microservices, each with multiple replicas, and you’re looking at thousands of time series from the application alone — before you count the metrics from Kubernetes itself, from the node exporters, from the kube-state-metrics, from the CNI plugin, from the ingress controller. I’ve seen Prometheus instances consuming 8GB of RAM for modest clusters. The monitoring is using more resources than the thing being monitored.

Cardinality will find you. This is the word that strikes fear into the heart of every platform engineer who’s operated Prometheus at scale. High cardinality happens when a label has too many unique values — think user_id, request_id, or trace_id as a metric label. Each unique value creates a new time series. One developer adds user_id as a label to a request duration histogram, and suddenly Prometheus is tracking a separate time series for every user who’s ever made a request. Memory usage goes vertical. Query performance collapses. And the fix is social, not technical: you need to catch it in code review before it ships. Prometheus can’t protect you from your own labels.

Logs are worse than metrics. Metrics are aggregated — a counter goes up, you store one number. Logs are individual events. Every HTTP request, every function call, every debug statement someone left in the code generates a log line. A single chatty service can produce gigabytes of logs per day. Loki’s whole design philosophy is “index the labels, not the content” — it stores log lines cheaply and only indexes metadata like namespace, pod, and container. This makes storage cheap compared to Elasticsearch, but it makes queries slow when you’re searching log content. You end up in a tradeoff: index more and pay for storage, or index less and pay with query latency.

And log levels are a lie. Every service starts with sensible log levels in development. Then it goes to production, something breaks, and someone sets it to DEBUG to figure out what happened. DEBUG stays on because nobody remembers to turn it off. Now Promtail is shipping ten times more data to Loki, the ingesters are backed up, and your observability stack is in distress because your application is in distress. Cascading failures, but in the meta-layer.

Traces are the newest promise and the hardest to get right. Distributed tracing — following a request through six services, seeing where the latency lives — is genuinely transformative when it works. When a request to the order service takes 2 seconds instead of 200 milliseconds, a trace will show you that 1.8 seconds was spent waiting for the inventory service, which was waiting for PostgreSQL, which was doing a sequential scan because someone forgot an index. That’s invaluable.

But getting traces to work requires instrumentation in every service. Every service needs to propagate trace context — the traceparent header in the HTTP request, the Kafka message headers for async flows. One service that doesn’t propagate context breaks the trace. And sampling is mandatory at any reasonable scale — you can’t store a trace for every request, so you sample at some rate (1%? 10?), which means the specific request you want to debug might not have been sampled. Head-based sampling decides at the start whether to trace a request, so slow requests get sampled at the same rate as fast ones. Tail-based sampling is better (decide after the request completes whether it was interesting enough to keep) but requires buffering all traces temporarily, which adds yet more infrastructure.

Here’s what I actually think: observability is worth it. The alternative — SSHing into containers and tailing log files, or adding print statements and redeploying, or just guessing — is genuinely worse. Prometheus metrics with Grafana dashboards will tell you what’s wrong before your users do. Logs will tell you why. Traces will tell you where. The three pillars are real, and they work.

But the industry talks about observability like it’s a checkbox. Add Prometheus. Add Grafana. Done. Nobody warns you that you’re adopting a distributed system to monitor your distributed system, and that the operational burden is substantial. Nobody mentions that your observability stack will need its own alerting (who alerts you when Prometheus is down?). Nobody talks about the cost — not just infrastructure cost, but the cognitive cost of maintaining, upgrading, and debugging the monitoring itself.

The honest conversation about observability starts with: what can you afford to operate? If you’re a small team with six services on a Kind cluster, the full Prometheus-Loki-Tempo stack is a learning exercise, not a production pattern. In production with that team size, you might be better off with a managed service — Datadog, Grafana Cloud, whatever — where someone else worries about ingester scaling and retention policies and cardinality explosions. You trade money for operational burden, and for a small team, that’s usually the right trade.

If you’re a platform team operating for a hundred developers, running your own observability stack makes sense — but budget for it like you’d budget for a product. It needs dedicated people, capacity planning, and its own SLOs. Treating observability infrastructure as an afterthought is how you end up with a Prometheus that OOMs every Tuesday and a Loki that loses logs during incidents, which is exactly when you need them most.

The three pillars of observability are metrics, logs, and traces. The hidden fourth pillar is the team that keeps the first three running.