The Observability Ouroboros

ClickHouse ran an ad this week bragging about Netflix’s logging system. Five petabytes per day. Ten million events per second. Forty thousand microservices. Three hundred million subscribers.

The ad wants you to be impressed. You should be concerned.

The Engineering Feat That Serves Itself

Netflix employs some of the best distributed systems engineers on the planet. The ad mentions “key optimizations in fingerprinting, serialization, and queries.” These are real problems requiring genuine expertise. The engineering is impressive.

And it exists entirely to look at logs.

Even unevenly distributed across services, the aggregate volume tells the same story: this is not insight, it is uncertainty. When a system generates that much telemetry, it is confessing that it cannot predict its own behavior.

The logging infrastructure exists to monitor the architectural choices that created the logging problem. This is the observability ouroboros: elite engineers building systems to watch systems, forever. The snake eating its own tail.

High observability volume is a lagging indicator of architectural indecision.

The Opportunity Cost Nobody Measures

Organizations measure their observability spend in two ways: infrastructure cost and tooling licenses. Both miss the real number.

The engineers optimizing fingerprinting algorithms and query performance are not building product. Every hour spent making the logging system handle the complexity is an hour not spent on features customers pay for. The talent bill dwarfs the infrastructure bill, but it never appears on a dashboard.

Netflix can afford this. They have the margins to employ world-class engineers on internal tooling. They have the scale where these optimizations produce measurable returns. They are the canonical example of a company whose infrastructure challenges are genuinely unique.

You are not Netflix.

The danger is not that Netflix built this. The danger is the company with 200 engineers and 150 microservices watching this ad and thinking “we should scale our observability too.” They copy the solution without examining whether they have the same disease.

Complexity Begets Complexity

Forty thousand microservices is not a success metric. It is forty thousand potential failure points. Forty thousand things to deploy, monitor, coordinate, and debug. Forty thousand sources of “why is this broken?”

At that fragmentation level, petabyte-scale logging is not a choice. It is a requirement. You need that much telemetry because you have that much surface area. The observability stack is not solving the complexity. It is a symptom of it.

Every architectural decision that increases system surface area creates downstream costs in monitoring, debugging, and coordination. The original decision shows up in a sprint planning meeting. The downstream costs show up everywhere else, forever, until someone pays to reverse them. And the real sin is that most of these decisions are effectively irreversible: the political and operational cost of unwinding them exceeds the pain of carrying them forward.

Most organizations never connect these dots. They see the observability bill as the cost of doing business rather than the interest payment on architectural debt.

What Boring Systems Do Not Need

A system built for predictability does not generate petabytes of uncertainty.

When components are few and well-understood, you do not need machine learning to correlate logs. When failure modes are designed and documented, you do not need distributed tracing to understand what broke. When the architecture fits in someone’s head, you do not need a query engine that handles ten million events per second.

Boring systems are legible. They tell you what they are doing without requiring an engineering team to translate. The logging exists to confirm expected behavior, not to search for unexpected behavior in a haystack of telemetry.

When observability shifts from confirming known failure modes to reconstructing unknown ones, you are no longer operating the system. You are excavating it.

The Vendor Incentive

ClickHouse sells database performance. Their incentive is to celebrate the customers with the biggest, hardest problems. The ad is not advice. It is a case study designed to make you feel like your problems should be bigger.

Vendor marketing reveals what they optimize for. ClickHouse optimizes for ingestion volume and query speed. Those are real capabilities. They are also precisely the capabilities you need when your architecture has created a five-petabyte-per-day problem.

The question before buying a faster shovel is whether you should be digging the hole. ClickHouse will sell you the shovel. You pay for the engineers to swing it.

When The Rosy Path Ends

Netflix built this infrastructure during the growth era. Subscriber counts went up and to the right. The engineering org scaled for a future that justified the investment.

That future is no longer certain. Password crackdowns. Ad tiers. Price hikes. Content budgets slashed. Shows canceled after one season to avoid paying creators.

But infrastructure does not know the business model changed. It just keeps ingesting. Five petabytes today. Five petabytes tomorrow. The S3 bill does not care that subscribers are leaving. When growth was assumed, the logging costs were a rounding error on future revenue. When growth stalls, those costs become fixed overhead.

At some point, someone in finance will ask why they are paying to store every HTTP 200 from a service that supports a show they already canceled. The answer will be: because turning it off requires an engineer who understands the system, and we laid off the team that built it.

Spotify already lived this. Glenn McDonald built the systems behind Wrapped’s most beloved features. Laid off December 2023. One year later, Wrapped 2024 launched without those features. McDonald posted: “Touching to have Spotify celebrate the 1-year anniversary of my layoff by demonstrating that they could not, in fact, still do Wrapped just as well without me.”

The infrastructure outlives the assumptions that justified it, but nobody has the institutional knowledge or political capital to simplify it. The “you are not Netflix” warning cuts both ways: you should not copy their architecture because you lack their scale, and they built for a scale they may not keep.

The Boring Alternative

The alternative to petabyte-scale observability is not ignorance. It is architecture that does not require petabyte-scale observation.

Fewer services mean fewer boundaries to trace. Simpler systems mean smaller blast radii. Predictable behavior means alerts that actually indicate problems rather than noise that trains everyone to ignore them.

This is not glamorous. It will not get you a vendor case study. It will get you engineers who ship product instead of engineers who maintain the machinery that watches the machinery.

The goal is not to be impressive. The goal is to be boring.

boring (adj.): A system state where observability confirms expected behavior rather than searching for unexpected behavior, because the architecture is simple enough to have expectations.